Transforming Video Metadata Management with AI: Lessons from DPG Media
The Power of AI in Enhancing Video Metadata: A Case Study with DPG Media
This post was co-written with Lucas Desard, Tom Lauwers, and Sam Landuydt from DPG Media.
DPG Media is a leading media company in Benelux operating multiple online platforms and TV channels. DPG Media’s VTM GO platform alone offers over 500 days of non-stop content.
With a growing library of long-form video content, DPG Media recognizes the importance of efficiently managing and enhancing video metadata such as actor information, genre, summary of episodes, the mood of the video, and more. Having descriptive metadata is key to providing accurate TV guide descriptions, improving content recommendations, and enhancing the consumer’s ability to explore content that aligns with their interests and current mood.
This post shows how DPG Media introduced AI-powered processes using Amazon Bedrock and Amazon Transcribe into its video publication pipelines in just 4 weeks, as an evolution towards more automated annotation systems.
The challenge: Extracting and generating metadata at scale
DPG Media receives video productions accompanied by a wide range of marketing materials such as visual media and brief descriptions. These materials often lack standardization and vary in quality. As a result, DPG Media Producers have to run a screening process to consume and understand the content sufficiently to generate the missing metadata, such as brief summaries. For some content, additional screening is performed to generate subtitles and captions.
As DPG Media grows, they need a more scalable way of capturing metadata that enhances the consumer experience on online video services and aids in understanding key content characteristics.
The following were some initial challenges in automation:
– Language diversity – The services host both Dutch and English shows. Some local shows feature Flemish dialects, which can be difficult for some large language models (LLMs) to understand.
– Variability in content volume – They offer a range of content volume, from single-episode films to multi-season series.
– Release frequency – New shows, episodes, and movies are released daily.
– Data aggregation – Metadata needs to be available at the top-level asset (program or movie) and must be reliably aggregated across different seasons.
Solution overview
To address the challenges of automation, DPG Media decided to implement a combination of AI techniques and existing metadata to generate new, accurate content and category descriptions, mood, and context.
The project focused solely on audio processing due to its cost-efficiency and faster processing time. Video data analysis with AI wasn’t required for generating detailed, accurate, and high-quality metadata.
The general architecture of the metadata pipeline consists of two primary steps:
– Generate transcriptions of audio tracks: use speech recognition models to generate accurate transcripts of the audio content.
– Generate metadata: use LLMs to extract and generate detailed metadata from the transcriptions.
In the following sections, we discuss the components of the pipeline in more detail.
Step 1. Generate transcriptions of audio tracks
To generate the necessary audio transcripts for metadata extraction, the DPG Media team evaluated two different transcription strategies: Whisper-v3-large, which requires at least 10 GB of vRAM and high operational processing, and Amazon Transcribe, a managed service with the added benefit of automatic model updates from AWS over time and speaker diarization. The evaluation focused on two key factors: price-performance and transcription quality.
To evaluate the transcription accuracy quality, the team compared the results against ground truth subtitles on a large test set, using the following metrics:
– Word error rate (WER) – This metric measures the percentage of words that are incorrectly transcribed compared to the ground truth. A lower WER indicates a more accurate transcription.
– Match error rate (MER) – MER assesses the proportion of correct words that were accurately matched in the transcription. A lower MER signifies better accuracy.
– Word information lost (WIL) – This metric quantifies the amount…
**[Continue reading on the original source >>](provide the URL)**