Automating Accessible Audio Descriptions for Visual Content Using AWS AI Services
A Comprehensive Guide to Leveraging Generative AI for Accessibility Compliance
Solution Overview
Services Used
Prerequisites
Solution Walkthrough
Clean Up
Conclusion
About the Authors
Automating Audio Descriptions for Accessibility in Media Using AWS
According to the World Health Organization, over 2.2 billion people have vision impairment globally. This statistic underscores the importance of accessibility in media, especially for visually impaired audiences. In compliance with legislation like the Americans with Disabilities Act (ADA) in the United States, visual formats such as television shows and movies are required to provide accessibility options, often in the form of audio description tracks that narrate visual elements.
However, producing audio descriptions can be costly, averaging $25 per minute when utilizing third-party services. The internal creation of these descriptions involves significant resources, including content creators, audio engineers, and narration talent. This raises the question: Can generative AI solutions, particularly those offered by AWS, automate this process?
AWS Nova Foundation Models: A Game-Changer
At the recent re:Invent 2024, Amazon announced the Amazon Nova Foundation Models, now accessible through Amazon Bedrock, which includes:
- Amazon Nova Lite: A fast, low-cost model for processing various inputs.
- Amazon Nova Pro: A versatile model offering an optimal blend of speed, accuracy, and cost for diverse tasks.
- Amazon Nova Premier: The most advanced model for complex assignments and model distillation.
Automating Audio Descriptions
In this blog post, we discuss how to leverage services like Amazon Nova, Amazon Rekognition, and Amazon Polly to automate the generation of audio descriptions for video content. This method can dramatically decrease the time and cost associated with making videos accessible to visually impaired viewers.
Note: The blog will not provide a complete production-ready solution but will feature pseudocode snippets, guidance, and links to resources to facilitate your development.
Solution Overview
The architecture of the proposed solution allows the integration of various AWS services to complete the audio description workflow efficiently. We recommend running your script on an Amazon SageMaker notebook for optimal performance.
Key AWS Services Used
- Amazon S3: For storing video files, text descriptions, and audio outputs.
- Amazon Rekognition: To detect and segment video scenes using visual cues.
- Amazon Bedrock: To access the Amazon Nova Pro model for analyzing video content and generating detailed descriptions.
- Amazon Polly: For converting text descriptions into high-quality audio.
Prerequisites
To implement this solution, ensure you have:
- AWS SDK set up, with Boto3 integrated.
- A mechanism for video slicing, such as the
moviepylibrary for Python.
Solution Walkthrough
1. Initializing AWS Environment
Start by defining the necessary AWS configurations, including the Nova Pro model for visual support:
class VideoAnalyzer:
def initialize(self):
AWS_REGION = "us-east-1"
MODEL_ID = "amazon.nova-pro-v1:0"
chunk_delay = 20
# Initialize AWS clients (Bedrock and Rekognition)
2. Segmenting Video Content
Use Amazon Rekognition to detect scene boundaries based on various cues (e.g., shot boundaries, black frames):
def get_segment_results(job_id):
# Implement the function to retrieve segmentation data
3. Analyzing Video Scenes
Utilize the Nova Pro model to analyze each video segment and generate descriptive text.
def analyze_chunk(chunk_path):
# Logic to convert video chunk into base64 and analyze
4. File Management and Consolidation
Compile all analysis results into a comprehensive text file, which serves as the basis for audio descriptions.
def analyze_video(video_path, bucket):
# Orchestrate video analysis and save the results
5. Text-to-Speech Conversion
Send the description text to Amazon Polly for voice synthesis, generating an MP3 audio file.
def generate_audio(text_file, output_audio_file):
# Logic for generating audio from the text analysis
Clean Up
Remember to delete any temporary resources created during the workflow to avoid unnecessary costs.
Conclusion
By employing AWS services like S3, Rekognition, Nova Pro, and Polly, media creators can fully automate the process of generating audio descriptions, significantly reducing time and costs. This not only aids in creating accessible content but also helps businesses comply with accessibility regulations.
Future Considerations
The outlined solution is applicable to various forms of visual media beyond just films and TV shows. With further development and scaling considerations, it can serve as a robust tool for improving accessibility in all forms of visual storytelling.
For more information about the Amazon Nova model family and its capabilities, explore the documentation on Amazon’s official website.
About the Authors
Dylan Martin is an AWS Solutions Architect primarily focused on generative AI, bringing extensive experience from various roles in software engineering and security.
Ankit Patel is a Solutions Developer at AWS, specializing in turning customer ideas into rapid prototype applications using AWS technologies.
This automated audio description approach can help bridge the accessibility gap, ensuring that everyone can enjoy and engage with visual content.