Real-Time Speech-to-Text with Amazon SageMaker AI and vLLM: A Comprehensive Guide to Bidirectional Streaming
Key Features Required to Run Voice AI Applications
Solution Overview
The Realtime API Protocol
Prerequisites
Build the Custom vLLM Container
Deploy to a SageMaker AI Endpoint
Test with Bidirectional Streaming
Stream Audio and Receive Transcription
Running the Client
Live Microphone Demo with Gradio
Considerations
Clean Up
Conclusion
Next Steps
About the Authors
Revolutionizing Real-Time Speech Transcription with Amazon SageMaker and vLLM
In today’s fast-paced digital landscape, real-time communication tools such as voice agents, live captioning services, and advanced contact center analytics are no longer optional—they’re essential. At the core of these applications lies a critical demand for efficient real-time speech-to-text capabilities. Traditional request-response systems often fall short, introducing latency that disrupts the fluid user experience essential for these applications.
Starting November 2025, however, a new era begins with Amazon SageMaker AI’s introduction of bidirectional streaming, which allows developers to stream data continuously. This blog post explores how to harness the combined power of Amazon SageMaker AI and the vLLM Realtime API for real-time audio transcription.
Key Features for Voice AI Applications
Building a production-ready voice AI application requires several critical components working harmoniously to maintain low latency. Here’s how Amazon SageMaker AI and vLLM address these needs:
1. Real-Time Speech Models
At the heart of any voice AI system is a robust Automatic Speech Recognition (ASR) model. vLLM supports real-time audio processing, producing transcription tokens incrementally. The vLLM Realtime API minimizes GPU overhead, ensuring lower latency for each token, while offering flexibility and control over model configurations thanks to its open-source nature.
2. Bidirectional Streaming Infrastructure
Conventional APIs require complete audio files to be sent before processing. In contrast, voice AI applications benefit from a persistent full-duplex connection that simultaneously streams audio and transcriptions. SageMaker AI supports this with native HTTP/2 bidirectional streaming, effectively bridging the interfaces on the client and server sides.
3. Audio Processing and Encoding
Audio comes in varying formats that must be standardized before reaching the ASR model. The client-side pipeline manages this conversion, while the vLLM API establishes a clear protocol for streaming audio and receiving transcription tokens.
4. Connection Management
SageMaker AI maintains WebSocket connections with ping/pong keepalive frames and offers endpoint monitoring via Amazon CloudWatch, providing essential production observability.
In summary, while vLLM delivers performance-focused, open-source model serving, SageMaker AI provides managed infrastructure that supports operational readiness.
Solution Overview
By following this guide, you will set up a custom Docker container that runs a real-time speech model within a SageMaker AI endpoint. Here’s what you’ll accomplish:
- Create a custom Docker container with bidirectional streaming capabilities.
- Deploy the Voxtral-Mini-4B-Realtime-2602 model on SageMaker AI.
- Develop a Python client for real-time audio streaming and transcription.
- Implement a live microphone demo using Gradio for real-time speech transcription.
The Realtime API Protocol
The vLLM Realtime API allows seamless audio transcription in real-time as audio is streamed. The message flow includes:
- Connection establishment.
- Sending session updates to select the desired model.
- Streaming audio chunks.
- Receiving transcription tokens in real-time.
The model begins processing as soon as it has enough audio context, ensuring a dynamic and responsive interaction.
Deployment Steps
After creating your custom Docker container, you can deploy it to a SageMaker AI endpoint. This involves defining the model environment, setting up the container, and deploying it as follows:
# Create model
voxtral_model = Model.create(
model_name=model_name,
...
)
# Create and deploy the endpoint
endpoint_config = EndpointConfig.create(...)
endpoint = Endpoint.create(...)
endpoint.wait_for_status("InService")
With the endpoint live, you can communicate with it using the AWS SDK or Python client, enabling audio streaming and receiving immediate transcription.
Testing and Demonstration
To evaluate your setup, you can utilize two testing options:
- File-based Client: Stream pre-recorded audio files to test the system.
- Live Microphone Demo: Use Gradio to capture live audio from your microphone to demonstrate real-time transcription capabilities.
Running the Client
Execute your Python client script to stream audio:
python sagemaker_bidi_client.py ./audio.wav --region us-east-1
Transcription text will appear in real time as your audio streams in.
Conclusion
In this post, we showcased how to deploy Mistral AI’s Voxtral-Mini-4B-Realtime-2602 model with Amazon SageMaker AI’s bidirectional streaming capabilities. This powerful combination enables seamless real-time transcription and a host of other promising applications from voice agents to interactive audio generation.
Next Steps
Now that you have a working solution, consider extending its capabilities by:
- Creating additional features like transcript exports or downstream processing.
- Tuning the endpoint for specific performance and cost needs.
- Testing out different models supported by the vLLM Realtime API.
By leveraging these technologies, you’ll be well on your way to building cutting-edge, real-time voice applications that meet the growing demands of users today.
About the Authors
- Christian Kamwangala: AI/ML Specialist Solutions Architect at AWS, focusing on optimizing AI solutions.
- Vivek Gangasani: Lead GenAI Specialist Architect for SageMaker Inference, developing strategies for optimizing inference performance.
- Lingran Xia: Software engineer focused on improving machine learning inference performance.
- Chinmay Bapat: Engineering Manager at AWS, leading efforts in scalable infrastructure for generative AI inference.
With the right tools and knowledge, the future of voice AI applications is brighter than ever. Happy coding!