Leveraging AWS and Pipecat to Build Intelligent Voice Agents: A Comprehensive Guide

Introduction to Intelligent Voice Agents

This post is a collaboration between AWS and Pipecat. Deploying intelligent voice agents that maintain natural, human-like conversations requires streaming to users where they are, across web, mobile, and phone channels, even under heavy traffic and unreliable network conditions…

Benefits of AgentCore Runtime for Voice Agents

Streaming Architectures for Voice Agents on AgentCore Runtime

Example Approach: Using WebSockets Bi-directional Streaming

Example Approach: Using WebRTC Bi-directional Streaming with TURN Assistance

Example Approach: Using Managed WebRTC on AWS Marketplace

Example Approach: Using a Telephony Provider

Conclusion

Additional Resources

About the Authors

Deploying Intelligent Voice Agents with Pipecat and AWS

This post is a collaboration between AWS and Pipecat.

Deploying intelligent voice agents that maintain natural, human-like conversations can be a complex task. It requires real-time streaming capabilities across web, mobile, and phone channels while navigating heavy traffic and unreliable network conditions. Even minor delays can disrupt conversational flow, causing users to perceive the agent as unresponsive or unreliable. In scenarios such as customer support, virtual assistants, and outbound campaigns, a natural conversational experience is crucial.

In this series of posts, we will explore how streaming architectures can address these challenges using Pipecat voice agents on the Amazon Bedrock AgentCore Runtime.

Benefits of AgentCore Runtime for Voice Agents

Creating real-time voice agents poses unique challenges: low-latency streaming, security through strict isolation, and dynamic scalability to handle unpredictable conversation volumes. Without a well-designed architecture, you may encounter issues like audio jitter, scalability constraints, inflated costs from over-provisioning, and increased complexity.

Amazon Bedrock AgentCore Runtime tackles these challenges by offering a secure, serverless environment designed to scale dynamic AI agents. Each conversation session operates within isolated microVMs for enhanced security, automatically scaling to handle traffic spikes and managing continuous sessions for up to eight hours—perfect for long, multi-turn interactions. Moreover, it charges only for resources actively used, minimizing costs tied to idle infrastructure.

Pipecat, a flexible framework for building real-time voice AI pipelines, runs seamlessly on AgentCore Runtime with minimal setup. You can encapsulate your Pipecat voice pipeline as a container and deploy it directly to AgentCore Runtime, leveraging its bidirectional streaming for real-time audio and built-in observability to trace agent reasoning and tool interactions.

Important Note:

AgentCore Runtime requires ARM64 (Graviton) containers, so ensure your Docker images are built for the linux/arm64 platform.

Streaming Architectures for Voice Agents on AgentCore Runtime

Before diving in, it’s beneficial to understand two common voice agent architectures:

Cascaded Models: where you connect speech-to-text (STT) and text-to-speech (TTS) models in a pipeline.
Speech-to-Speech Models: like Amazon Nova Sonic.

Latency is a crucial factor in building voice agents, directly impacting how natural and reliable conversations appear. Ideally, the end-to-end latency should be under one second to maintain a smooth interaction.

To achieve low latency, consider the following bi-directional streaming paths:

Client to Agent: Your voice agents run on a variety of devices and applications, from web browsers to mobile apps, each with distinct network conditions.
Agent to Model: Bidirectional streaming with speech models is essential. Most speech models provide real-time WebSocket APIs that your agent runtime can utilize for audio input and text or speech output. Choosing models optimized for latency, like Amazon Nova Sonic, is vital for achieving quick Time-to-First-Token (TTFT).
Telephony: For traditional calls, integration with a telephony provider is necessary, typically achieved via handoff and/or Session Interconnect Protocol (SIP) transfer.

In this first part, we’ll focus on the Client to Agent connection, emphasizing how to minimize first-hop network latency and discussing considerations related to other voice agent architecture components.

Various Network Transport Approaches

To illustrate effective network transport approaches, we’ll consider the following methods:

Approach	Description	Performance Consistency	Ease of Implementation	Suitable For
WebSockets	Connects web and mobile applications directly to agents.	Good	Simple	Prototyping and lightweight use cases.
WebRTC (TURN-assisted)	Direct connection leveraging TURN servers.	Excellent	Medium	Production use with low latency.
WebRTC (Managed)	Connect through a globally distributed infrastructure.	Excellent	Simple	Production use with global optimization.
Telephony	Access via traditional phone calls.	Excellent	Medium	Contact center and telephony use cases.

Example Approach: Using WebSockets Bi-Directional Streaming

WebSockets is a great starting point: it supports most clients and AgentCore Runtime natively. You can deploy Pipecat voice agents on AgentCore Runtime using persistent, bidirectional WebSocket connections for audio streaming between client devices and agent logic.

The connection flow operates in three simple steps:

Client Requests Endpoint: A POST request is sent to an intermediary server to get a secure WebSocket connection endpoint.
Intermediary Handles Auth: The intermediary server uses the AWS SDK to generate an AWS SigV4 pre-signed URL for authentication.
Direct Connection Established: The client connects to the agent using the pre-signed URL, facilitating bidirectional audio streaming while bypassing the intermediary for ongoing communications.

Example Approach: Using WebRTC

While WebSockets work for simple deployments, WebRTC excels in performance. It utilizes a fast, lightweight network path, typically relying on UDP for low latency. In cases where UDP is unavailable, it defaults to TCP for reliability, albeit with slight delays.

Pipecat supports SmallWebRTCTransport for direct peer-to-peer connections. It operates without complex media servers, making deployment within AgentCore Runtime straightforward.

The connection flow involves:

Signaling: The client initiates with a Session Description Protocol (SDP) offer, which is processed by the intermediary server.
Connectivity Establishment: The optimal network path is determined with Interactive Connectivity Establishment (ICE) protocol.

Configuring AgentCore Runtime for WebRTC Connectivity

When using WebRTC, you must configure ICE_SERVER_URLS in both your intermediary server and runtime environment. This enables bidirectional traffic facilitating UDP transport to TURN servers, ensuring a seamless connection.

Using Managed WebRTC

Managed WebRTC providers can offer TURN servers and globally distributed media servers to simplify deployment while enhancing performance. Consider leveraging these offerings for production-level voice agents.

Conclusion

The Amazon Bedrock AgentCore Runtime delivers a secure, serverless infrastructure to reliably scale voice agents. We’ve highlighted how low latency is essential for intuitive conversations, examining crucial transport modes such as WebSockets, TURN-assisted WebRTC, managed WebRTC, and telephony integrations.

Start easy with WebSockets for quick prototyping, then consider moving to WebRTC with AgentCore on VPC mode or leveraging managed providers for more extensive production deployments.

In the upcoming Part 2 of this series, we’ll further explore streaming strategies for agent-to-model communication and other factors affecting end-to-end latency.

Get hands-on with the Pipecat on AgentCore code samples today and determine the best transport layer for your use case!

Additional Resources

About the Authors

Kwindla Hultman Kramer is the Co-founder and CEO at Daily, pioneering low-latency real-time voice, video, and multimodal AI infrastructure.

Paul Kompfner is a Member of Technical Staff at Daily and an expert in streaming infrastructure and voice-based agentic systems.

Kosti Vasilakakis is a Principal PM at AWS, deeply involved in the design and development of Bedrock AgentCore services.

Lana Zhang is a Senior Solutions Architect at AWS, specializing in AI and generative AI applications.

Sundar Raghavan is a Solutions Architect at AWS, focusing on developing integrations with AI agent frameworks.

Daniel Wirjo is a Solutions Architect at AWS, known for his collaborations with startups to drive innovation on AWS.

This blog post serves as a guide for deploying intelligent voice agents efficiently and will continue to evolve with new insights and techniques. Stay tuned for the next installment!

Exclusive Content:

Deploying Voice Agents Using Pipecat and Amazon Bedrock AgentCore Runtime – Part 1

Leveraging AWS and Pipecat to Build Intelligent Voice Agents: A Comprehensive Guide

Introduction to Intelligent Voice Agents

Benefits of AgentCore Runtime for Voice Agents

Streaming Architectures for Voice Agents on AgentCore Runtime

Example Approach: Using WebSockets Bi-directional Streaming

Example Approach: Using WebRTC Bi-directional Streaming with TURN Assistance

Example Approach: Using Managed WebRTC on AWS Marketplace

Example Approach: Using a Telephony Provider

Conclusion

Additional Resources

About the Authors

Deploying Intelligent Voice Agents with Pipecat and AWS

Benefits of AgentCore Runtime for Voice Agents

Important Note:

Streaming Architectures for Voice Agents on AgentCore Runtime

Various Network Transport Approaches

Example Approach: Using WebSockets Bi-Directional Streaming

Example Approach: Using WebRTC

Configuring AgentCore Runtime for WebRTC Connectivity

Using Managed WebRTC

Conclusion

Additional Resources

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe