Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Create Real-Time Voice Applications Using Amazon SageMaker AI and vLLM

Real-Time Speech-to-Text with Amazon SageMaker AI and vLLM: A Comprehensive Guide to Bidirectional Streaming

Key Features Required to Run Voice AI Applications

Solution Overview

The Realtime API Protocol

Prerequisites

Build the Custom vLLM Container

Deploy to a SageMaker AI Endpoint

Test with Bidirectional Streaming

Stream Audio and Receive Transcription

Running the Client

Live Microphone Demo with Gradio

Considerations

Clean Up

Conclusion

Next Steps

About the Authors

Revolutionizing Real-Time Speech Transcription with Amazon SageMaker and vLLM

In today’s fast-paced digital landscape, real-time communication tools such as voice agents, live captioning services, and advanced contact center analytics are no longer optional—they’re essential. At the core of these applications lies a critical demand for efficient real-time speech-to-text capabilities. Traditional request-response systems often fall short, introducing latency that disrupts the fluid user experience essential for these applications.

Starting November 2025, however, a new era begins with Amazon SageMaker AI’s introduction of bidirectional streaming, which allows developers to stream data continuously. This blog post explores how to harness the combined power of Amazon SageMaker AI and the vLLM Realtime API for real-time audio transcription.

Key Features for Voice AI Applications

Building a production-ready voice AI application requires several critical components working harmoniously to maintain low latency. Here’s how Amazon SageMaker AI and vLLM address these needs:

1. Real-Time Speech Models

At the heart of any voice AI system is a robust Automatic Speech Recognition (ASR) model. vLLM supports real-time audio processing, producing transcription tokens incrementally. The vLLM Realtime API minimizes GPU overhead, ensuring lower latency for each token, while offering flexibility and control over model configurations thanks to its open-source nature.

2. Bidirectional Streaming Infrastructure

Conventional APIs require complete audio files to be sent before processing. In contrast, voice AI applications benefit from a persistent full-duplex connection that simultaneously streams audio and transcriptions. SageMaker AI supports this with native HTTP/2 bidirectional streaming, effectively bridging the interfaces on the client and server sides.

3. Audio Processing and Encoding

Audio comes in varying formats that must be standardized before reaching the ASR model. The client-side pipeline manages this conversion, while the vLLM API establishes a clear protocol for streaming audio and receiving transcription tokens.

4. Connection Management

SageMaker AI maintains WebSocket connections with ping/pong keepalive frames and offers endpoint monitoring via Amazon CloudWatch, providing essential production observability.

In summary, while vLLM delivers performance-focused, open-source model serving, SageMaker AI provides managed infrastructure that supports operational readiness.

Solution Overview

By following this guide, you will set up a custom Docker container that runs a real-time speech model within a SageMaker AI endpoint. Here’s what you’ll accomplish:

  • Create a custom Docker container with bidirectional streaming capabilities.
  • Deploy the Voxtral-Mini-4B-Realtime-2602 model on SageMaker AI.
  • Develop a Python client for real-time audio streaming and transcription.
  • Implement a live microphone demo using Gradio for real-time speech transcription.

The Realtime API Protocol

The vLLM Realtime API allows seamless audio transcription in real-time as audio is streamed. The message flow includes:

  1. Connection establishment.
  2. Sending session updates to select the desired model.
  3. Streaming audio chunks.
  4. Receiving transcription tokens in real-time.

The model begins processing as soon as it has enough audio context, ensuring a dynamic and responsive interaction.

Deployment Steps

After creating your custom Docker container, you can deploy it to a SageMaker AI endpoint. This involves defining the model environment, setting up the container, and deploying it as follows:

# Create model
voxtral_model = Model.create(
    model_name=model_name,
    ...
)

# Create and deploy the endpoint
endpoint_config = EndpointConfig.create(...)
endpoint = Endpoint.create(...)
endpoint.wait_for_status("InService")

With the endpoint live, you can communicate with it using the AWS SDK or Python client, enabling audio streaming and receiving immediate transcription.

Testing and Demonstration

To evaluate your setup, you can utilize two testing options:

  1. File-based Client: Stream pre-recorded audio files to test the system.
  2. Live Microphone Demo: Use Gradio to capture live audio from your microphone to demonstrate real-time transcription capabilities.

Running the Client

Execute your Python client script to stream audio:

python sagemaker_bidi_client.py ./audio.wav --region us-east-1

Transcription text will appear in real time as your audio streams in.

Conclusion

In this post, we showcased how to deploy Mistral AI’s Voxtral-Mini-4B-Realtime-2602 model with Amazon SageMaker AI’s bidirectional streaming capabilities. This powerful combination enables seamless real-time transcription and a host of other promising applications from voice agents to interactive audio generation.

Next Steps

Now that you have a working solution, consider extending its capabilities by:

  • Creating additional features like transcript exports or downstream processing.
  • Tuning the endpoint for specific performance and cost needs.
  • Testing out different models supported by the vLLM Realtime API.

By leveraging these technologies, you’ll be well on your way to building cutting-edge, real-time voice applications that meet the growing demands of users today.

About the Authors

  • Christian Kamwangala: AI/ML Specialist Solutions Architect at AWS, focusing on optimizing AI solutions.
  • Vivek Gangasani: Lead GenAI Specialist Architect for SageMaker Inference, developing strategies for optimizing inference performance.
  • Lingran Xia: Software engineer focused on improving machine learning inference performance.
  • Chinmay Bapat: Engineering Manager at AWS, leading efforts in scalable infrastructure for generative AI inference.

With the right tools and knowledge, the future of voice AI applications is brighter than ever. Happy coding!

Latest

Chatbots Falling Short: Only 11% Success Rate Threatens Your Personal Banking Experience

AI's Payment Blockade: A Study Reveals Chatbots' Struggles in...

Scientists Unravel Mystery of Seven-Hour Deep Space Signal

The Fascinating Mystery of Gamma-Ray Bursts: Insights from Recent...

Create AI-Driven Dashboard Automation Agents Using NLP on Amazon Bedrock AgentCore

Accelerating Dashboard Modifications with AI: A Comprehensive Solution Overview This...

I Transformed Drake’s Best Lyrics into a Productivity System with ChatGPT — and It Surprisingly Motivated Me

Finding Focus: How Drake's Lyrics Can Boost Your Productivity Incorporating...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Create an AI-Enhanced Recruitment Assistant with Amazon Bedrock

Streamlining Recruitment: Building an AI-Powered Assistant with Amazon Bedrock This heading captures the essence of the content, highlighting both the goal of improving recruitment efficiency...

Overcome the Context Window Limitation with Amazon Bedrock AgentCore

Overcoming Context Window Limitations in Document Analysis Using Recursive Language Models Unlocking Insights Beyond Context Boundaries: A Guide to Recursive Language Models Introduction to the Challenge...

Integrating AWS API MCP Server with Amazon QuickSight via Amazon Bedrock...

Streamline AWS Operations with Amazon Bedrock AgentCore Runtime and Model Context Protocol: A Comprehensive Guide Achieving Simplicity in Complex AWS Workflows As your AWS infrastructure scales,...