Building a Scalable Multimodal Video Search System with Amazon Nova and OpenSearch

Transforming Video Datasets into Semantic Search Capabilities

This article provides a comprehensive guide on implementing a scalable multimodal video search system that utilizes Amazon Nova models and Amazon OpenSearch Service. Dive into the world of natural language search across extensive video datasets, moving beyond traditional tagging and keyword methods.

Processing Large Video Libraries Efficiently

Learn how we efficiently processed 792,270 videos from two datasets to facilitate advanced search capabilities while managing costs effectively.

Solution Architecture Overview

Understand the architecture of our system, which integrates ingestion and search workflows, and allows for various search methods (text-to-video, video-to-video, and hybrid).

Prerequisites for Implementation

Before starting, ensure you have the necessary AWS account and services configured, including IAM roles and OpenSearch Service.

Step-by-Step Walkthrough

Explore detailed steps for setting up the system—from creating IAM roles to processing videos, generating embeddings, and implementing diverse search functionalities.

Performance Insights

Discover the performance metrics and cost considerations, including query latencies and storage requirements as we scale to handle large datasets.

Conclusion

Find out how this architecture not only meets current needs but also provides a robust foundation for future enhancements and scaling.

About the Authors

Meet the minds behind this solution and their expertise in media, entertainment, and AI technologies.

Building a Scalable Multimodal Video Search System with Amazon Nova and OpenSearch

In today’s digital landscape, the sheer volume of video content is rising exponentially. To harness this rich dataset, organizations need scalable and efficient search systems that can facilitate natural language querying. This blog post demonstrates how to construct a sophisticated multimodal video search system using Amazon’s Nova models and OpenSearch Service. We will guide you through the nuances of moving beyond manual tagging and keyword-based search to leveraging semantic search capabilities that capture the richness of video content.

Processing Large Datasets at Scale

To illustrate this solution, we processed a massive dataset of 792,270 videos sourced from two datasets hosted on the AWS Open Data Registry: Multimedia Commons (787,479 videos with an average duration of 37 seconds) and MEVA (4,791 videos averaging 5 minutes). The total processing time of 8,480 hours (30.5M seconds of video) took just 41 hours and incurred a first-year total cost of $27,328 for on-demand services or $23,632 when using Reserved Instances.

Ingestion Breakdown

Amazon EC2 Compute:
- 4× c7i.48xlarge spot at $2.57/hour × 41 hours = $421
Amazon Bedrock Nova Multimodal Embeddings:
- (30.5M seconds) × $0.00056/second batch pricing = $17,096
Nova Pro Tagging:
- 792K videos × 600 tokens(avg.) = $571

This solution efficiently generates audio-visual embeddings via the AUDIO_VIDEO_COMBINED mode should be the backbone of your indexing strategy.

Solution Overview

The architecture is structured around two fundamental workflows: ingestion and search. This setup is designed to meaningfully enable multimodal video search at scale.

Video Ingestion Pipeline

Utilizing four Amazon EC2 c7i.48xlarge instances, the ingestion pipeline boasts 600 parallel workers, enabling the processing of 19,400 videos/hour. The asynchronous API handles a limit of 30 concurrent jobs per account, necessitating an intelligent job queue system that continually submits jobs and polls for completion.

Upload Videos: Store videos in Amazon S3.
Process Using Nova: The asynchronous API segments video into 15-second chunks to efficiently generate embeddings.
Generate Tags: Use Nova Pro (or Nova Lite) to assign 10-15 tags to each video.
Index the Data: Store embeddings and metadata tags in dual OpenSearch indexes, facilitating efficient retrieval based on diverse search modes.

Types of Searches Enabled

Text-to-video Search: Converts natural language queries into embeddings for semantic similarity.
Video-to-video Search: Finds similar content through direct comparison of video embeddings.
Hybrid Search: Combines vector similarity (weighted 70%) with keyword matching (30%) to enhance accuracy.

Walkthrough

Step 1: Create IAM Roles and Policies

Configure an IAM role that permits invoking Amazon Bedrock models, allowing read/write access to S3 objects and permissions for OpenSearch indexing.

Step 2: Set Up OpenSearch Service Indexes

Create two indexes in OpenSearch Service, one focused on vector embeddings and one for text metadata. This allows for seamless hybrid and semantic search queries.

Step 3: Process Videos with Nova Multimodal Embeddings

Using the Amazon Bedrock async API, process your uploaded videos to generate embeddings. This step entails segmenting the videos into smaller manageable sections to enhance embedding accuracy.

Step 4: Generate Metadata Tags

Generate descriptive tags using Nova Pro or Nova Lite, leveraging a predefined taxonomy for optimal search capabilities.

Step 5: Index Embeddings and Tags in OpenSearch Service

Efficiently store your video embeddings and tags in OpenSearch Service using bulk indexing.

Step 6: Implement Search Functionality

After ingestion, implement low-latency search capabilities with well-defined APIs for natural language queries, video discovery, and hybrid search.

Performance Insights

After indexing all videos, the search performance exhibited remarkable efficiency:

Semantic k-NN search: ~76ms
BM25 text search: ~30ms
Hybrid search: ~106ms

Storage Requirements

k-NN index: 28.8 GB
Text index: 1.0 GB

This efficient use of storage makes it manageable for modern OpenSearch clusters.

Conclusion

Through this post, we explored a comprehensive solution for building a multimodal video search system capable of handling large datasets and enabling rich search capabilities. By integrating Amazon Nova models with OpenSearch Service, you can leverage semantic search capabilities to unlock the full potential of your video content.

About the Authors

Hammad Ausaf – Principal Solutions Architect in Media and Entertainment, passionate about providing the best solutions to AWS customers.

Rajat Jain – Technical Account Manager in Media and Entertainment, a GenAI/ML enthusiast dedicated to building innovative solutions.

For inquiries or to learn more about the technologies discussed in this walkthrough, explore Amazon Nova Multimodal Embeddings and Hybrid Search with Amazon OpenSearch Service.

Exclusive Content:

Scalable Multimodal Embeddings: An AI Data Lake for Media and Entertainment Applications

Building a Scalable Multimodal Video Search System with Amazon Nova and OpenSearch

Transforming Video Datasets into Semantic Search Capabilities

Processing Large Video Libraries Efficiently

Solution Architecture Overview

Prerequisites for Implementation

Step-by-Step Walkthrough

Performance Insights

Conclusion

About the Authors

Building a Scalable Multimodal Video Search System with Amazon Nova and OpenSearch

Processing Large Datasets at Scale

Ingestion Breakdown

Solution Overview

Video Ingestion Pipeline

Types of Searches Enabled

Walkthrough

Step 1: Create IAM Roles and Policies

Step 2: Set Up OpenSearch Service Indexes

Step 3: Process Videos with Nova Multimodal Embeddings

Step 4: Generate Metadata Tags

Step 5: Index Embeddings and Tags in OpenSearch Service

Step 6: Implement Search Functionality

Performance Insights

Storage Requirements

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe