Leveraging AWS and Pipecat to Build Intelligent Voice Agents: A Comprehensive Guide
Introduction to Intelligent Voice Agents
This post is a collaboration between AWS and Pipecat. Deploying intelligent voice agents that maintain natural, human-like conversations requires streaming to users where they are, across web, mobile, and phone channels, even under heavy traffic and unreliable network conditions…
Benefits of AgentCore Runtime for Voice Agents
Streaming Architectures for Voice Agents on AgentCore Runtime
Example Approach: Using WebSockets Bi-directional Streaming
Example Approach: Using WebRTC Bi-directional Streaming with TURN Assistance
Example Approach: Using Managed WebRTC on AWS Marketplace
Example Approach: Using a Telephony Provider
Conclusion
Additional Resources
About the Authors
Deploying Intelligent Voice Agents with Pipecat and AWS
This post is a collaboration between AWS and Pipecat.
Deploying intelligent voice agents that maintain natural, human-like conversations can be a complex task. It requires real-time streaming capabilities across web, mobile, and phone channels while navigating heavy traffic and unreliable network conditions. Even minor delays can disrupt conversational flow, causing users to perceive the agent as unresponsive or unreliable. In scenarios such as customer support, virtual assistants, and outbound campaigns, a natural conversational experience is crucial.
In this series of posts, we will explore how streaming architectures can address these challenges using Pipecat voice agents on the Amazon Bedrock AgentCore Runtime.
Benefits of AgentCore Runtime for Voice Agents
Creating real-time voice agents poses unique challenges: low-latency streaming, security through strict isolation, and dynamic scalability to handle unpredictable conversation volumes. Without a well-designed architecture, you may encounter issues like audio jitter, scalability constraints, inflated costs from over-provisioning, and increased complexity.
Amazon Bedrock AgentCore Runtime tackles these challenges by offering a secure, serverless environment designed to scale dynamic AI agents. Each conversation session operates within isolated microVMs for enhanced security, automatically scaling to handle traffic spikes and managing continuous sessions for up to eight hours—perfect for long, multi-turn interactions. Moreover, it charges only for resources actively used, minimizing costs tied to idle infrastructure.
Pipecat, a flexible framework for building real-time voice AI pipelines, runs seamlessly on AgentCore Runtime with minimal setup. You can encapsulate your Pipecat voice pipeline as a container and deploy it directly to AgentCore Runtime, leveraging its bidirectional streaming for real-time audio and built-in observability to trace agent reasoning and tool interactions.
Important Note:
AgentCore Runtime requires ARM64 (Graviton) containers, so ensure your Docker images are built for the linux/arm64 platform.
Streaming Architectures for Voice Agents on AgentCore Runtime
Before diving in, it’s beneficial to understand two common voice agent architectures:
- Cascaded Models: where you connect speech-to-text (STT) and text-to-speech (TTS) models in a pipeline.
- Speech-to-Speech Models: like Amazon Nova Sonic.
Latency is a crucial factor in building voice agents, directly impacting how natural and reliable conversations appear. Ideally, the end-to-end latency should be under one second to maintain a smooth interaction.
To achieve low latency, consider the following bi-directional streaming paths:
- Client to Agent: Your voice agents run on a variety of devices and applications, from web browsers to mobile apps, each with distinct network conditions.
- Agent to Model: Bidirectional streaming with speech models is essential. Most speech models provide real-time WebSocket APIs that your agent runtime can utilize for audio input and text or speech output. Choosing models optimized for latency, like Amazon Nova Sonic, is vital for achieving quick Time-to-First-Token (TTFT).
- Telephony: For traditional calls, integration with a telephony provider is necessary, typically achieved via handoff and/or Session Interconnect Protocol (SIP) transfer.
In this first part, we’ll focus on the Client to Agent connection, emphasizing how to minimize first-hop network latency and discussing considerations related to other voice agent architecture components.
Various Network Transport Approaches
To illustrate effective network transport approaches, we’ll consider the following methods:
| Approach | Description | Performance Consistency | Ease of Implementation | Suitable For |
|---|---|---|---|---|
| WebSockets | Connects web and mobile applications directly to agents. | Good | Simple | Prototyping and lightweight use cases. |
| WebRTC (TURN-assisted) | Direct connection leveraging TURN servers. | Excellent | Medium | Production use with low latency. |
| WebRTC (Managed) | Connect through a globally distributed infrastructure. | Excellent | Simple | Production use with global optimization. |
| Telephony | Access via traditional phone calls. | Excellent | Medium | Contact center and telephony use cases. |
Example Approach: Using WebSockets Bi-Directional Streaming
WebSockets is a great starting point: it supports most clients and AgentCore Runtime natively. You can deploy Pipecat voice agents on AgentCore Runtime using persistent, bidirectional WebSocket connections for audio streaming between client devices and agent logic.
The connection flow operates in three simple steps:
- Client Requests Endpoint: A POST request is sent to an intermediary server to get a secure WebSocket connection endpoint.
- Intermediary Handles Auth: The intermediary server uses the AWS SDK to generate an AWS SigV4 pre-signed URL for authentication.
- Direct Connection Established: The client connects to the agent using the pre-signed URL, facilitating bidirectional audio streaming while bypassing the intermediary for ongoing communications.
Example Approach: Using WebRTC
While WebSockets work for simple deployments, WebRTC excels in performance. It utilizes a fast, lightweight network path, typically relying on UDP for low latency. In cases where UDP is unavailable, it defaults to TCP for reliability, albeit with slight delays.
Pipecat supports SmallWebRTCTransport for direct peer-to-peer connections. It operates without complex media servers, making deployment within AgentCore Runtime straightforward.
The connection flow involves:
- Signaling: The client initiates with a Session Description Protocol (SDP) offer, which is processed by the intermediary server.
- Connectivity Establishment: The optimal network path is determined with Interactive Connectivity Establishment (ICE) protocol.
Configuring AgentCore Runtime for WebRTC Connectivity
When using WebRTC, you must configure ICE_SERVER_URLS in both your intermediary server and runtime environment. This enables bidirectional traffic facilitating UDP transport to TURN servers, ensuring a seamless connection.
Using Managed WebRTC
Managed WebRTC providers can offer TURN servers and globally distributed media servers to simplify deployment while enhancing performance. Consider leveraging these offerings for production-level voice agents.
Conclusion
The Amazon Bedrock AgentCore Runtime delivers a secure, serverless infrastructure to reliably scale voice agents. We’ve highlighted how low latency is essential for intuitive conversations, examining crucial transport modes such as WebSockets, TURN-assisted WebRTC, managed WebRTC, and telephony integrations.
Start easy with WebSockets for quick prototyping, then consider moving to WebRTC with AgentCore on VPC mode or leveraging managed providers for more extensive production deployments.
In the upcoming Part 2 of this series, we’ll further explore streaming strategies for agent-to-model communication and other factors affecting end-to-end latency.
Get hands-on with the Pipecat on AgentCore code samples today and determine the best transport layer for your use case!
Additional Resources
About the Authors
Kwindla Hultman Kramer is the Co-founder and CEO at Daily, pioneering low-latency real-time voice, video, and multimodal AI infrastructure.
Paul Kompfner is a Member of Technical Staff at Daily and an expert in streaming infrastructure and voice-based agentic systems.
Kosti Vasilakakis is a Principal PM at AWS, deeply involved in the design and development of Bedrock AgentCore services.
Lana Zhang is a Senior Solutions Architect at AWS, specializing in AI and generative AI applications.
Sundar Raghavan is a Solutions Architect at AWS, focusing on developing integrations with AI agent frameworks.
Daniel Wirjo is a Solutions Architect at AWS, known for his collaborations with startups to drive innovation on AWS.
This blog post serves as a guide for deploying intelligent voice agents efficiently and will continue to evolve with new insights and techniques. Stay tuned for the next installment!