Unlocking Video Insights: Leveraging Multimodal AI with TwelveLabs Marengo Model on Amazon Bedrock

The Complexity of Video Content and Its Challenges for AI

Introducing the Multimodal Embedding Model: TwelveLabs Marengo

Enhancing Video Understanding through Advanced AI Integration

Understanding Video Embeddings: The Foundation of Semantic Search

Overcoming Unique Challenges in Video Embeddings

TwelveLabs Marengo: Elevating Multimodal Understanding

Solution Overview: Transformative Capabilities of the Marengo Model

Prerequisites for Implementation

Sample Video: Utilizing Netflix Open Content

Creating Video Embeddings with Amazon Bedrock

Storing and Managing Embeddings in Amazon OpenSearch Serverless

Indexing Marengo Embeddings for Efficient Retrieval

Cross-Modal Semantic Search: A New Era of Video Discovery

Conclusion: The Future of Video Understanding with AI

About the Authors: Expertise Behind the Innovation

Unlocking Video Content with TwelveLabs Marengo and Amazon Bedrock: A Fresh Approach to Multimodal Video Understanding

In the realms of media and entertainment, advertising, education, and enterprise training, the confluence of visual, audio, and motion elements has revolutionized storytelling and information conveyance. Unlike traditional text-based content that relies on clear word meanings, video presents a unique complexity that poses significant challenges for AI systems. This post explores those challenges and highlights an innovative solution: the TwelveLabs Marengo Embed 3.0 model, integrated with Amazon Bedrock.

The Challenge of Video Content for AI

Video content is inherently multidimensional, integrating various components to create a rich narrative experience:

Visual Information: Objects, scenes, people, and actions.
Audio Components: Dialogue, music, sound effects.
Text Elements: Captions, subtitles, and on-screen text.

This complexity leads to considerable business challenges, such as:

Searching through vast video archives.
Locating specific scenes efficiently.
Automating content categorization.
Extracting meaningful insights for decision-making.

The Unique Solution: Multi-Vector Architecture

The TwelveLabs Marengo model tackles these challenges with a multi-vector architecture that crafts specialized embeddings for various content modalities. Instead of compressing all information into a single vector, Marengo generates distinct representations that preserve the multifaceted essence of video data. This allows for more precise analysis across visual, temporal, and audio dimensions.

Real-Time Enhancements with Amazon Bedrock

Amazon Bedrock now bolsters its capabilities with support for the TwelveLabs Marengo model, providing real-time text and image processing through synchronous inference. This integration enables businesses to:

Implement faster video search functionalities using natural language queries.
Facilitate interactive product discovery through sophisticated image similarity matching.

In this post, we will demonstrate how Marengo’s embedding model enhances video understanding using multimodal AI, specifically building a video semantic search and analysis solution utilizing Amazon OpenSearch Serverless as the vector database.

Understanding Video Embeddings

Embeddings are dense vector representations that capture the semantic meaning of data in a high-dimensional space. They serve as numerical fingerprints encoding the essence of content in a manner machines can comprehend and compare.

For instance:

Text embeddings can recognize relationships, such as "king" and "queen."
Image embeddings can identify similarity between breeds, like golden retrievers and Labradors.

![Heat Map Example of Semantic Similarity]()

Challenges with Video Embeddings

Video presents unique challenges because of its inherently multimodal nature:

Visual Information: Objects, scenes, and actions.
Audio Information: Speech, music, and ambient sounds.
Textual Information: Captions and transcribed speech.

Traditional single-vector approaches often convolute this rich information into one representation, leading to lost nuances. However, TwelveLabs Marengo model’s multi-vector architecture effectively addresses this.

The Marengo Advantage

The Marengo 3.0 model generates multiple specialized vectors, addressing different aspects of video content. This nuanced approach allows for:

Flexible searches targeting specific content aspects (visual, audio, or both).
Enhanced accuracy in complex multimodal scenarios.
Efficient scalability for extensive enterprise video datasets.

Developing a Semantic Search Solution

In this section, we’ll explore how to leverage the Marengo model technology with practical coding examples. For this, we’ll utilize a sample video from Netflix Open Content licensed under Creative Commons.

Prerequisites

Ensure you have:

AWS account with access to Amazon Bedrock and OpenSearch Serverless.
Necessary libraries (e.g., Boto3) installed.

Create a Video Embedding

Use the following Python snippet to generate embeddings from a video stored in an S3 bucket:

import boto3

bedrock_client = boto3.client("bedrock-runtime")
model_id = 'us.twelvelabs.marengo-embed-3-0-v1:0'
video_s3_uri = ""  # Replace with your S3 URI
aws_account_id = ""  # Replace with bucket owner ID
s3_bucket_name = ""  # Replace with output S3 bucket name
s3_output_prefix = ""  # Replace with output prefix

response = bedrock_client.start_async_invoke(
    modelId=model_id,
    modelInput={
        "inputType": "video",
        "video": {
            "mediaSource": {
                "s3Location": {
                    "uri": video_s3_uri,
                    "bucketOwner": aws_account_id
                }
            }
        }
    },
    outputDataConfig={
        "s3OutputDataConfig": {
            "s3Uri": f's3://{s3_bucket_name}/{s3_output_prefix}'
        }
    }
)

This call produces numerous embeddings from the video, capturing various sections for precise temporal search and analysis.

Indexing with Amazon OpenSearch Serverless

Leveraging Amazon OpenSearch Serverless allows efficient storage and searching through generated embeddings. Here’s how to create a vector database collection:

aoss_client = boto3.client('opensearchserverless')

collection = aoss_client.create_collection(
    name='your_collection_name', type="VECTORSEARCH"
)

Once created, develop an index with attributes that include a vector field for embeddings:

index_mapping = {
    "mappings": {
        "properties": {
            "video_id": {"type": "keyword"},
            "segment_id": {"type": "integer"},
            "start_time": {"type": "float"},
            "end_time": {"type": "float"},
            "embedding": {
                "type": "dense_vector",
                "dims": 1024,
                "index": True,
                "similarity": "cosine"
            }
        }
    }
}

Cross-Modal Semantic Search

The true power of Marengo lies in cross-modal semantic search capabilities. You can query using text, images, or audio, enabling a comprehensive search experience. For instance, searching for “jazz music playing” can yield relevant video clips featuring musicians, specific audio tracks, and relevant concert scenes.

Example Code for Text Search

text_query = "a person smoking in a room"

modelInput = {
    "inputType": "text",
    "text": {
        "inputText": text_query
    }
}

response = bedrock_client.invoke_model(
    modelId="us.twelvelabs.marengo-embed-3-0-v1:0",
    body=json.dumps(modelInput)
)

query_embedding = json.loads(response["body"].read())["data"][0]["embedding"]

# Search in OpenSearch
search_body = {
    "query": {
        "knn": {
            "embedding": {
                "vector": query_embedding,
                "k": 5
            }
        }
    }
}

response = opensearch_client.search(index='your_index_name', body=search_body)

Conclusion

The integration of TwelveLabs Marengo with Amazon Bedrock is a game-changer for video understanding. By utilizing a multi-vector, multimodal approach, organizations can transform vast amounts of video data into searchable, actionable content. With advanced capabilities such as cross-modal search and real-time processing, businesses can better manage, analyze, and leverage their video assets.

As video content continues to dominate the digital landscape, innovative models like Marengo offer a solid foundation for building intelligent video analysis solutions.

About the Authors

Wei Teh: Machine Learning Solutions Architect at AWS with a passion for cutting-edge AI solutions.
Lana Zhang: Specialist Solutions Architect focusing on Generative AI, committed to transforming business solutions across various industries.
Yanyan Zhang: Senior Generative AI Data Scientist at AWS, dedicated to advancing AI technologies for customer success.

Explore the code snippets in our [GitHub repository]() and see how multimodal understanding can reshape your applications!

Exclusive Content:

Enhancing Video Comprehension with TwelveLabs Marengo on Amazon Bedrock