Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Optimizing NVIDIA Nemotron Speech ASR on Amazon EC2 for Domain Adaptation

Fine-Tuning and Deploying NVIDIA’s Parakeet TDT 0.6B V2 for Enhanced Automatic Speech Recognition with AWS


Collaboration Between AWS, NVIDIA, and Heidi
Explore how to fine-tune the NVIDIA Parakeet TDT 0.6B V2 model using AWS infrastructure for superior ASR capabilities across various domains, including healthcare.


Introduction: Advancements in Speech Recognition Technology

Solution Overview: Introducing Heidi’s AI Care Partner

Synthesizing Domain-Specific Data for Enhanced Accuracy

Understanding the NVIDIA Parakeet TDT 0.6B V2 Model Architecture

Benefits of Fine-Tuning ASR Models for Specific Domains

Model Architecture and Components Explained

Setting Up the Environment for Fine-Tuning

Implementing the Fine-Tuning Process

Performance Monitoring: Optimizing for Success

Model Deployment Strategies: Ensuring Efficient Inference

Conclusion and Next Steps: Building on AWS for Future Innovations

Acknowledgements

About the Authors

Empowering Speech Technologies with AWS and NVIDIA: Fine-Tuning ASR for Specialized Domain Applications

This post is the result of a collaboration between AWS, NVIDIA, and Heidi.

Automatic speech recognition (ASR), often referred to as speech-to-text (STT), is rapidly becoming essential in various industries, including healthcare, customer service, and media production. While pre-trained models offer robust capabilities for general speech, fine-tuning these models for specific applications can significantly enhance their accuracy and performance.

The Power of Fine-Tuning ASR Models

In this post, we will delve into the process of fine-tuning the NVIDIA Nemotron Speech ASR model, Parakeet TDT 0.6B V2, using synthetic speech data to achieve exceptional transcription accuracy in specialized domains. We will detail an end-to-end workflow that harnesses the power of AWS infrastructure along with popular open-source frameworks:

  • Amazon EC2 GPU Instances (p4d.24xlarge with NVIDIA A100 GPUs): For distributed training at scale.
  • NVIDIA NeMo: Framework for ASR model fine-tuning and optimization.
  • DeepSpeed: For memory-efficient distributed training across multiple nodes.
  • MLflow and TensorBoard: For comprehensive experiment tracking.
  • Amazon EKS: For scalable model serving.
  • Amazon FSx for Lustre: For high-performance model weight storage.
  • AI Gateway and Langfuse: For production-grade API management and observability.
  • Docker: For consistent, reproducible environments across training and inference.

This architecture illustrates how AWS managed services can be utilized alongside premier open-source AI tools to build production-ready, domain-adapted ASR systems that provide measurable business value—from initial fine-tuning to elastic, observable deployment.

Solution Overview: Heidi’s AI Care Partner

Heidi is an AI Care Partner designed to alleviate the administrative burden in healthcare, focusing on handling documentation, clinical evidence, and patient communications. The platform supports over 2.4 million consultations weekly in 110 languages across 190 countries. By streamlining the workload for clinicians, Heidi helps them reclaim precious time while ensuring the accuracy and integrity of clinical records.

Out-of-the-box ASR models often falter when faced with medical terminology, regional accents, and the complexities of code-switching between clinical and conversational language. These limitations can lead to transcription errors and heightened cognitive loads that impede clinicians’ ability to focus on patient care. Accurate documentation is not just a matter of convenience—it’s critical for clinical safety, liability protection, and trust in the tools being used.

To address these challenges, Heidi collaborated with the AWS Generative AI Innovation Center (GenAIIC) to fine-tune and adapt the model to the nuanced linguistic, acoustic, and contextual factors of real-world clinical environments. By leveraging advancements in text-to-speech (TTS) models, Heidi generated high-quality, multilingual synthetic speech interspersed with realistic background noises. This approach allowed for the creation of a diverse training dataset that emphasizes low-resource languages and rare medical terms.

Synthesizing Domain-Specific Data

To enhance the performance of the NVIDIA Parakeet TDT 0.6B V2 on specialized medical terminology, we established a synthetic data generation pipeline that merged large language models (LLMs), neural TTS synthesis, and noise augmentation. Initially, a lexicon of medical terms—spanning drug names, anatomical entities, and procedural phrases—was compiled. These terms served as inputs for a domain-adapted LLM, which generated contextually rich transcripts resembling real-world clinical dictations.

The synthesized transcripts were transformed into speech using a neural TTS system tailored for specific accents. To further enrich data diversity, a multi-stage audio augmentation pipeline was employed, overlaying ambient noises and employing controlled perturbations to replicate real-world conditions.

Introduction to the NVIDIA Parakeet TDT 0.6B V2 Model

The NVIDIA Parakeet TDT 0.6B V2 is a 600-million parameter ASR model engineered for high-quality English transcription. Built on NVIDIA NeMo’s FastConformer architecture and featuring a Token-and-Duration Transducer (TDT) decoder, this model offers superior speech recognition capabilities. Key features include:

  • Automatic punctuation and capitalization
  • Word-level timestamp predictions
  • Robust performance on spoken numbers and song lyrics
  • Support for audio segments of up to 24 minutes in length

The model achieves an impressive average Word Error Rate (WER) of 6.05% across multiple benchmark datasets.

Why Fine-Tune the Model?

Although the base model displays commendable performance, fine-tuning it for specialized domains offers significant advantages:

  • Domain-specific terminology: Improving recognition of unique vocabulary and jargon.
  • Accent and dialect adaptation: Enhancing performance across various regional linguistic patterns.
  • Noise resilience: Optimizing performance in domain-specific acoustic environments.

Setting Up Your Environment for Fine-Tuning

Our fine-tuning approach leverages distributed training on Amazon EC2 instances, encapsulated in a Docker container for consistent deployment.

Docker-Based Environment Setup

The Docker container contains the necessary dependencies for fine-tuning, optimizing the use of NVIDIA’s PyTorch container, NeMo framework, DeepSpeed, MLflow, and TensorBoard.

Resource Requirements

For efficient fine-tuning, we recommend the p4d.24xlarge instance type featuring 8 NVIDIA A100 GPUs, ensuring high-bandwidth memory support essential for handling the model’s large parameter count.

Implementing the Fine-Tuning Process

Our fine-tuning strategy adopts a modular approach through a dedicated ASRTrainer class, effectively managing model initialization and training.

Performance Monitoring and Optimization

Continuous monitoring during training is crucial for validating effective learning. We utilize MLflow for detailed tracking of training metrics and model checkpoints, while DeepSpeed offers memory optimization techniques that enable the training of large models even with limited hardware.

Model Deployment: Ensuring Scalability and Efficiency

Model deployment is a multifaceted process that encompasses latency, cost, security, observability, and elasticity. By leveraging AWS services, we create a framework that ensures seamless scalability and effective management of resources.

Exposing the Model

To streamline user access, we expose our ASR model via standard APIs, utilizing tools like FastAPI for creating endpoints that interact with the model while maintaining security and observability standards.

AI Gateway and Observability

Integrating AI Gateway and Langfuse into our EKS infrastructure streamlines orchestration and monitoring, allowing for end-to-end visibility across model serving and user interactions.

Conclusion and Next Steps

In this post, we showcased how AWS provides a comprehensive, production-ready solution for fine-tuning and deploying custom ASR models. From distributed training on AWS GPU instances to scalable inference on Amazon EKS, organizations can now create domain-specific speech recognition systems that yield tangible business results.

We encourage experimentation with the resources provided throughout this post and invite you to adapt these solutions for your unique use cases. For additional support, reach out to your AWS account team to explore potential collaborations with the AWS Generative AI Innovation Center (GAIIC). Happy building!

Acknowledgments

We extend our gratitude to all individuals involved in this collaboration for their invaluable contributions and insights.

About the Authors

The post is authored by a dedicated team including specialists from AWS and NVIDIA, each bringing a wealth of experience in cloud architecture, applied data science, and machine learning solutions. Together, they aim to drive innovation and help organizations harness the full potential of AI technologies in their operations.

Latest

Real-Time Voice Agents Using Stream Vision Agents and Amazon Nova 2 Sonic

Building Production-Grade Real-Time Voice Agents with Stream and Amazon...

Go.Compare Introduces Insurance App Powered by ChatGPT

Go.Compare Launches ChatGPT App for Effortless Insurance Comparison Go.Compare Launches...

Dstl-Backed Robotics Innovation Revolutionizes Military Manufacturing – A Case Study

Revolutionizing Manufacturing: Rivelin Robotics’ Innovations in Precision Finishing for...

Understanding Patient Sentiment in Atopic Dermatitis Management

Insights into Patient Sentiment and Treatment Perceptions in Atopic...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Real-Time Voice Agents Using Stream Vision Agents and Amazon Nova 2...

Building Production-Grade Real-Time Voice Agents with Stream and Amazon Bedrock Co-Authored by Neevash Ramdial, Technical Marketing Leader at Stream Creating natural and responsive production-grade voice agents...

Create Financial Document Processing Solutions Using Pulse AI and Amazon Bedrock

Transforming Financial Document Processing: Leveraging Pulse AI and Amazon Bedrock for Accurate Data Extraction Introduction Financial institutions process thousands of complex documents daily. Optical Character Recognition...

Automating Schema Creation for Smart Document Processing

Streamlining Document Processing: Introducing Multi-Document Discovery for Intelligent Document Processing (IDP) Overcoming Schema Challenges in Large Document Collections The IDP Accelerator: Revolutionizing Document Processing Automated Solution Overview...