Building Production-Grade Real-Time Voice Agents with Stream and Amazon Bedrock
Co-Authored by Neevash Ramdial, Technical Marketing Leader at Stream
Creating natural and responsive production-grade voice agents is a multifaceted engineering challenge. This guide explores how to harness the Stream Vision Agents framework alongside Amazon Bedrock and Nova 2 Sonic for efficient, real-time voice application development.
Building Production-Grade Voice Agents: A Technical Overview
This post was co-authored with Neevash Ramdial, Technical Marketing Leader at Stream.
Creating voice agents that are responsive and feel natural to users is a challenging endeavor. It involves seamlessly orchestrating speech-to-speech models, managing low-latency audio streaming, and ensuring a smooth connection lifecycle. Additionally, maintaining a consistent experience across web, mobile, and desktop platforms adds another layer of complexity.
In this post, we’ll explore how you can harness Stream’s Vision Agents open-source framework in conjunction with Amazon Bedrock and Amazon Nova 2 Sonic to build real-time voice agents that are production-ready in minutes. We’ll dive into integration details, provide code examples, and highlight advanced capabilities like function calling, automatic reconnection, and multilingual voice support.
The Challenge
Building voice-enabled AI applications requires synchronizing multiple intricate systems to function together reliably. You need to deal with the complexities of real-time audio streaming while integrating key services like speech recognition, language models, and text-to-speech (TTS). Each of these elements has distinct latency characteristics and potential failures.
Typically, a voice interaction includes:
- Capturing audio from the user’s microphone.
- Streaming it to a speech-to-text service.
- Processing the transcript through a language model.
- Generating a response.
- Converting that response back to speech.
- Delivering it to the user.
This entire process must occur within a few hundred milliseconds to maintain a natural conversation flow. Delays can disrupt this flow and lead to user frustration.
Beyond the core AI functions, production voice applications also need to tackle real-world deployment issues such as unreliable network connections, browser compatibility challenges, session timeouts, and graceful degradation when services become unavailable. Much of your development time can be consumed by creating reconnection logic, managing WebRTC connections, and addressing edge cases, often leading teams to either spend months building custom solutions or opting for limited off-the-shelf products.
Vision Agents facilitates the abstraction of the underlying infrastructure complexities while retaining the flexibility to customize the AI experience.
Solution Overview
The solution comprises three key components:
-
Amazon Nova 2 Sonic: This speech-to-speech foundation model available through Amazon Bedrock handles bidirectional audio streaming and function calling capabilities, avoiding the need for separate STT and TTS services.
-
Stream’s Vision Agents: An open-source Python framework designed for real-time voice and video AI agents, it provides a plugin-based architecture with over 25 integrations and production deployment tooling. Vision Agents enables flexibility, allowing you to utilize Stream’s global edge network or integrate your chosen real-time communication providers.
-
Stream’s Edge Network: A globally distributed edge network offering sub-500ms join times and sub-30ms audio latency. This serves as the real-time transport layer between clients and your agent backend.
Together, these components form a cohesive stack that enables high-quality, real-time interactions for voice agents. Stream manages real-time media transport and client experiences, Amazon Nova 2 Sonic brings in AI intelligence, and Vision Agents provides the integration middleware.
Architecture Overview
The system is structured around a clear separation of concerns:
- Customer AWS account: Handles business logic, orchestration, and Amazon Bedrock integration for accessing models.
- Stream AWS account: Manages the global WebRTC/SFU media plane, signaling, and the Vision Agent runtime.
When a user interacts with the system, their audio is captured and transmitted securely to the nearest Stream SFU, which handles NAT traversal and bandwidth estimation. Then, the audio is processed by the Vision Agent worker, which interacts with Amazon Nova 2 Sonic in the customer AWS account.
The audio flows bidirectionally:
- Incoming user speech is decoded and streamed to Amazon Nova Sonic.
- Nova Sonic responds with audio frames, which are then sent back to the user using RTP packets through the WebRTC session.
End-to-End Media Flow
- User Joins: The application utilizes Stream’s audio client SDK to capture audio and join a call.
- SFU Termination: A regional SFU node handles the WebRTC connection.
- Vision Agent Processing: A Vision Agent worker manages the session, decoding audio and streaming it to Amazon Nova Sonic.
- Response Handling: The system processes response frames and delivers audio back to the user with low latency.
This sophisticated architecture allows developers to focus on AI capabilities and user experience while Vision Agents handles complex infrastructure management.
Getting Started
Prerequisites
- AWS Credentials: Set up through environment variables or IAM roles instead of long-term credentials.
- Stream Account: Obtain an Audio API key and secret.
- Python Installation: Use Python 3.12 or later.
- uv Package Manager: Install with
pip install uv. - Vision Agents: Install using
uv add vision-agents.
Step 1: Create a New Project
mkdir voice-agent
cd voice-agent
uv init
uv add "vision-agents[getstream,aws]"
Step 2: Configure Environment Variables
Create a ".env" file to manage configurations without hardcoding sensitive data.
# Stream API credentials
STREAM_API_KEY=test/getstream/api_key
STREAM_API_SECRET=test/getstream/api_secret
# AWS credentials
AWS_PROFILE=your_aws_profile_name
AWS_REGION=us-east-1
Step 3: Build Your First Voice Agent
Create a main.py file with the following code:
import asyncio
from dotenv import load_dotenv
from vision_agents.core import Agent, User, Runner
from vision_agents.core.agents import AgentLauncher
from vision_agents.plugins import aws, getstream
load_dotenv()
async def create_agent(**kwargs) -> Agent:
agent = Agent(
edge=getstream.Edge(),
agent_user=User(name="Helpful Assistant", id="agent"),
instructions="You are a helpful voice assistant. Be concise and friendly.",
llm=aws.Realtime(
model="amazon.nova-2-sonic-v1:0",
region_name="us-east-1",
voice_id="matthew",
),
)
return agent
async def join_call(agent: Agent, call_type: str, call_id: str, **kwargs) -> None:
call = await agent.create_call(call_type, call_id)
async with agent.join(call):
await asyncio.sleep(2)
await agent.llm.simple_response(text="Greet the user warmly and ask how you can help.")
await agent.finish()
if __name__ == "__main__":
Runner(AgentLauncher(create_agent=create_agent, join_call=join_call)).cli()
Step 4: Run the Voice Agent
Execute your agent with:
uv run main.py run
In just a few lines of code, you’ve created a fully functional, real-time voice agent using Amazon Nova Sonic through Stream’s client SDK.
Use Cases
Now that the technical groundwork is laid, let’s explore meaningful use cases across various industries:
1. Voice Interfaces for No-Screen Environments
In scenarios where screens are impractical—like driving or field service—voice becomes the main interface. The seamless conversation flow offered by Vision Agents and Nova 2 Sonic allows users to issue commands and receive responses naturally without any screen interaction.
2. High-Volume Inbound Phone Support
By deploying voice agents capable of handling large volumes of inbound calls, organizations can reduce queue times and efficiently manage repetitive requests. This allows human agents to focus on more complex issues.
Conclusion
This post provided an in-depth look at how to build real-time voice agents using Stream’s Vision Agents framework and Amazon Bedrock with Amazon Nova 2 Sonic. We discussed architecture, the bidirectional streaming protocol, reconnection handling, function calling, and multilingual support.
The combined power of Stream’s low-latency edge network and Amazon Nova Sonic’s speech capabilities create a robust foundation for developing voice AI applications. With Vision Agents handling the intricate orchestration, developers can focus their efforts on enhancing user experiences with unique logic.
If you’re eager to explore further, consider extending your agent with custom functions for specific applications. Visit the Vision Agents repository for examples, plugin docs, and community support, and don’t miss the AWS plugin documentation for deeper integration insights.
About the Authors
Manasi Bhutada: An ISV Solutions Architect at AWS with a focus on delivering scalable solutions.
Jagdeep Singh Soni: A Senior AI/ML Solutions Architect specializing in generative AI and Amazon Bedrock solutions.
Neevash Ramdial: A technical marketing leader at Stream, passionate about enabling developers to build responsive AI agents.
Feel free to dive into this innovative technology landscape and start building your voice solutions!