Unlocking Video Insights: Leveraging Multimodal AI with TwelveLabs Marengo Model on Amazon Bedrock
The Complexity of Video Content and Its Challenges for AI
Introducing the Multimodal Embedding Model: TwelveLabs Marengo
Enhancing Video Understanding through Advanced AI Integration
Understanding Video Embeddings: The Foundation of Semantic Search
Overcoming Unique Challenges in Video Embeddings
TwelveLabs Marengo: Elevating Multimodal Understanding
Solution Overview: Transformative Capabilities of the Marengo Model
Prerequisites for Implementation
Sample Video: Utilizing Netflix Open Content
Creating Video Embeddings with Amazon Bedrock
Storing and Managing Embeddings in Amazon OpenSearch Serverless
Indexing Marengo Embeddings for Efficient Retrieval
Cross-Modal Semantic Search: A New Era of Video Discovery
Conclusion: The Future of Video Understanding with AI
About the Authors: Expertise Behind the Innovation
Unlocking Video Content with TwelveLabs Marengo and Amazon Bedrock: A Fresh Approach to Multimodal Video Understanding
In the realms of media and entertainment, advertising, education, and enterprise training, the confluence of visual, audio, and motion elements has revolutionized storytelling and information conveyance. Unlike traditional text-based content that relies on clear word meanings, video presents a unique complexity that poses significant challenges for AI systems. This post explores those challenges and highlights an innovative solution: the TwelveLabs Marengo Embed 3.0 model, integrated with Amazon Bedrock.
The Challenge of Video Content for AI
Video content is inherently multidimensional, integrating various components to create a rich narrative experience:
- Visual Information: Objects, scenes, people, and actions.
- Audio Components: Dialogue, music, sound effects.
- Text Elements: Captions, subtitles, and on-screen text.
This complexity leads to considerable business challenges, such as:
- Searching through vast video archives.
- Locating specific scenes efficiently.
- Automating content categorization.
- Extracting meaningful insights for decision-making.
The Unique Solution: Multi-Vector Architecture
The TwelveLabs Marengo model tackles these challenges with a multi-vector architecture that crafts specialized embeddings for various content modalities. Instead of compressing all information into a single vector, Marengo generates distinct representations that preserve the multifaceted essence of video data. This allows for more precise analysis across visual, temporal, and audio dimensions.
Real-Time Enhancements with Amazon Bedrock
Amazon Bedrock now bolsters its capabilities with support for the TwelveLabs Marengo model, providing real-time text and image processing through synchronous inference. This integration enables businesses to:
- Implement faster video search functionalities using natural language queries.
- Facilitate interactive product discovery through sophisticated image similarity matching.
In this post, we will demonstrate how Marengo’s embedding model enhances video understanding using multimodal AI, specifically building a video semantic search and analysis solution utilizing Amazon OpenSearch Serverless as the vector database.
Understanding Video Embeddings
Embeddings are dense vector representations that capture the semantic meaning of data in a high-dimensional space. They serve as numerical fingerprints encoding the essence of content in a manner machines can comprehend and compare.
For instance:
- Text embeddings can recognize relationships, such as "king" and "queen."
- Image embeddings can identify similarity between breeds, like golden retrievers and Labradors.
.
- Enhanced accuracy in complex multimodal scenarios.
- Efficient scalability for extensive enterprise video datasets.
Developing a Semantic Search Solution
In this section, we’ll explore how to leverage the Marengo model technology with practical coding examples. For this, we’ll utilize a sample video from Netflix Open Content licensed under Creative Commons.
Prerequisites
Ensure you have:
- AWS account with access to Amazon Bedrock and OpenSearch Serverless.
- Necessary libraries (e.g., Boto3) installed.
Create a Video Embedding
Use the following Python snippet to generate embeddings from a video stored in an S3 bucket:
import boto3
bedrock_client = boto3.client("bedrock-runtime")
model_id = 'us.twelvelabs.marengo-embed-3-0-v1:0'
video_s3_uri = "" # Replace with your S3 URI
aws_account_id = "" # Replace with bucket owner ID
s3_bucket_name = "" # Replace with output S3 bucket name
s3_output_prefix = "" # Replace with output prefix
response = bedrock_client.start_async_invoke(
modelId=model_id,
modelInput={
"inputType": "video",
"video": {
"mediaSource": {
"s3Location": {
"uri": video_s3_uri,
"bucketOwner": aws_account_id
}
}
}
},
outputDataConfig={
"s3OutputDataConfig": {
"s3Uri": f's3://{s3_bucket_name}/{s3_output_prefix}'
}
}
)
This call produces numerous embeddings from the video, capturing various sections for precise temporal search and analysis.
Indexing with Amazon OpenSearch Serverless
Leveraging Amazon OpenSearch Serverless allows efficient storage and searching through generated embeddings. Here’s how to create a vector database collection:
aoss_client = boto3.client('opensearchserverless')
collection = aoss_client.create_collection(
name='your_collection_name', type="VECTORSEARCH"
)
Once created, develop an index with attributes that include a vector field for embeddings:
index_mapping = {
"mappings": {
"properties": {
"video_id": {"type": "keyword"},
"segment_id": {"type": "integer"},
"start_time": {"type": "float"},
"end_time": {"type": "float"},
"embedding": {
"type": "dense_vector",
"dims": 1024,
"index": True,
"similarity": "cosine"
}
}
}
}
Cross-Modal Semantic Search
The true power of Marengo lies in cross-modal semantic search capabilities. You can query using text, images, or audio, enabling a comprehensive search experience. For instance, searching for “jazz music playing” can yield relevant video clips featuring musicians, specific audio tracks, and relevant concert scenes.
Example Code for Text Search
text_query = "a person smoking in a room"
modelInput = {
"inputType": "text",
"text": {
"inputText": text_query
}
}
response = bedrock_client.invoke_model(
modelId="us.twelvelabs.marengo-embed-3-0-v1:0",
body=json.dumps(modelInput)
)
query_embedding = json.loads(response["body"].read())["data"][0]["embedding"]
# Search in OpenSearch
search_body = {
"query": {
"knn": {
"embedding": {
"vector": query_embedding,
"k": 5
}
}
}
}
response = opensearch_client.search(index='your_index_name', body=search_body)
Conclusion
The integration of TwelveLabs Marengo with Amazon Bedrock is a game-changer for video understanding. By utilizing a multi-vector, multimodal approach, organizations can transform vast amounts of video data into searchable, actionable content. With advanced capabilities such as cross-modal search and real-time processing, businesses can better manage, analyze, and leverage their video assets.
As video content continues to dominate the digital landscape, innovative models like Marengo offer a solid foundation for building intelligent video analysis solutions.
About the Authors
- Wei Teh: Machine Learning Solutions Architect at AWS with a passion for cutting-edge AI solutions.
- Lana Zhang: Specialist Solutions Architect focusing on Generative AI, committed to transforming business solutions across various industries.
- Yanyan Zhang: Senior Generative AI Data Scientist at AWS, dedicated to advancing AI technologies for customer success.
Explore the code snippets in our [GitHub repository](