Real-Time Speech-to-Text with Amazon SageMaker AI and vLLM: A Comprehensive Guide to Bidirectional Streaming

Key Features Required to Run Voice AI Applications

Solution Overview

The Realtime API Protocol

Prerequisites

Build the Custom vLLM Container

Deploy to a SageMaker AI Endpoint

Test with Bidirectional Streaming

Stream Audio and Receive Transcription

Running the Client

Live Microphone Demo with Gradio

Considerations

Clean Up

Conclusion

Next Steps

About the Authors

Revolutionizing Real-Time Speech Transcription with Amazon SageMaker and vLLM

In today’s fast-paced digital landscape, real-time communication tools such as voice agents, live captioning services, and advanced contact center analytics are no longer optional—they’re essential. At the core of these applications lies a critical demand for efficient real-time speech-to-text capabilities. Traditional request-response systems often fall short, introducing latency that disrupts the fluid user experience essential for these applications.

Starting November 2025, however, a new era begins with Amazon SageMaker AI’s introduction of bidirectional streaming, which allows developers to stream data continuously. This blog post explores how to harness the combined power of Amazon SageMaker AI and the vLLM Realtime API for real-time audio transcription.

Key Features for Voice AI Applications

Building a production-ready voice AI application requires several critical components working harmoniously to maintain low latency. Here’s how Amazon SageMaker AI and vLLM address these needs:

1. Real-Time Speech Models

At the heart of any voice AI system is a robust Automatic Speech Recognition (ASR) model. vLLM supports real-time audio processing, producing transcription tokens incrementally. The vLLM Realtime API minimizes GPU overhead, ensuring lower latency for each token, while offering flexibility and control over model configurations thanks to its open-source nature.

2. Bidirectional Streaming Infrastructure

Conventional APIs require complete audio files to be sent before processing. In contrast, voice AI applications benefit from a persistent full-duplex connection that simultaneously streams audio and transcriptions. SageMaker AI supports this with native HTTP/2 bidirectional streaming, effectively bridging the interfaces on the client and server sides.

3. Audio Processing and Encoding

Audio comes in varying formats that must be standardized before reaching the ASR model. The client-side pipeline manages this conversion, while the vLLM API establishes a clear protocol for streaming audio and receiving transcription tokens.

4. Connection Management

SageMaker AI maintains WebSocket connections with ping/pong keepalive frames and offers endpoint monitoring via Amazon CloudWatch, providing essential production observability.

In summary, while vLLM delivers performance-focused, open-source model serving, SageMaker AI provides managed infrastructure that supports operational readiness.

Solution Overview

By following this guide, you will set up a custom Docker container that runs a real-time speech model within a SageMaker AI endpoint. Here’s what you’ll accomplish:

Create a custom Docker container with bidirectional streaming capabilities.
Deploy the Voxtral-Mini-4B-Realtime-2602 model on SageMaker AI.
Develop a Python client for real-time audio streaming and transcription.
Implement a live microphone demo using Gradio for real-time speech transcription.

The Realtime API Protocol

The vLLM Realtime API allows seamless audio transcription in real-time as audio is streamed. The message flow includes:

Connection establishment.
Sending session updates to select the desired model.
Streaming audio chunks.
Receiving transcription tokens in real-time.

The model begins processing as soon as it has enough audio context, ensuring a dynamic and responsive interaction.

Deployment Steps

After creating your custom Docker container, you can deploy it to a SageMaker AI endpoint. This involves defining the model environment, setting up the container, and deploying it as follows:

# Create model
voxtral_model = Model.create(
    model_name=model_name,
    ...
)

# Create and deploy the endpoint
endpoint_config = EndpointConfig.create(...)
endpoint = Endpoint.create(...)
endpoint.wait_for_status("InService")

With the endpoint live, you can communicate with it using the AWS SDK or Python client, enabling audio streaming and receiving immediate transcription.

Testing and Demonstration

To evaluate your setup, you can utilize two testing options:

File-based Client: Stream pre-recorded audio files to test the system.
Live Microphone Demo: Use Gradio to capture live audio from your microphone to demonstrate real-time transcription capabilities.

Running the Client

Execute your Python client script to stream audio:

python sagemaker_bidi_client.py ./audio.wav --region us-east-1

Transcription text will appear in real time as your audio streams in.

Conclusion

In this post, we showcased how to deploy Mistral AI’s Voxtral-Mini-4B-Realtime-2602 model with Amazon SageMaker AI’s bidirectional streaming capabilities. This powerful combination enables seamless real-time transcription and a host of other promising applications from voice agents to interactive audio generation.

Next Steps

Now that you have a working solution, consider extending its capabilities by:

Creating additional features like transcript exports or downstream processing.
Tuning the endpoint for specific performance and cost needs.
Testing out different models supported by the vLLM Realtime API.

By leveraging these technologies, you’ll be well on your way to building cutting-edge, real-time voice applications that meet the growing demands of users today.

About the Authors

Christian Kamwangala: AI/ML Specialist Solutions Architect at AWS, focusing on optimizing AI solutions.
Vivek Gangasani: Lead GenAI Specialist Architect for SageMaker Inference, developing strategies for optimizing inference performance.
Lingran Xia: Software engineer focused on improving machine learning inference performance.
Chinmay Bapat: Engineering Manager at AWS, leading efforts in scalable infrastructure for generative AI inference.

With the right tools and knowledge, the future of voice AI applications is brighter than ever. Happy coding!

Exclusive Content:

Create Real-Time Voice Applications Using Amazon SageMaker AI and vLLM

Real-Time Speech-to-Text with Amazon SageMaker AI and vLLM: A Comprehensive Guide to Bidirectional Streaming

Key Features Required to Run Voice AI Applications

Solution Overview

The Realtime API Protocol

Prerequisites

Build the Custom vLLM Container

Deploy to a SageMaker AI Endpoint

Test with Bidirectional Streaming

Stream Audio and Receive Transcription

Running the Client

Live Microphone Demo with Gradio

Considerations

Clean Up

Conclusion

Next Steps

About the Authors

Revolutionizing Real-Time Speech Transcription with Amazon SageMaker and vLLM

Key Features for Voice AI Applications

1. Real-Time Speech Models

2. Bidirectional Streaming Infrastructure

3. Audio Processing and Encoding

4. Connection Management

Solution Overview

The Realtime API Protocol

Deployment Steps

Testing and Demonstration

Running the Client

Conclusion

Next Steps

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe