Announcing Amazon Polly’s New Bidirectional Streaming API: Revolutionizing Real-Time Text-to-Speech Experiences

Elevating Conversational AI with Real-Time Synthesis

Understanding the Limitations of Traditional Text-to-Speech

Introducing Bidirectional Streaming: A New Era of Speech Synthesis

Traditional vs. Modern Approaches: A Comparative Analysis

Performance Metrics: Benchmarking the New API

Technical Implementation: Getting Started with Bidirectional Streaming

Key Integration Patterns with LLM Streaming

Business Benefits: Enhancing User Experience and Reducing Costs

Exploring Use Cases for the New API

Conclusion: Paving the Way for Conversational Excellence

Next Steps: Start Leveraging the Bidirectional Streaming API Today

Meet the Authors Behind the Innovation

Enhancing Conversational AI with Amazon Polly’s Bidirectional Streaming API

Building natural conversational experiences is no small feat, particularly when it comes to integrating text-to-speech (TTS) capabilities that can keep pace with real-time interactions. Today, we are thrilled to introduce the new Bidirectional Streaming API for Amazon Polly, which revolutionizes TTS synthesis by allowing seamless text input and audio output simultaneously.

The Need for Real-Time Text-to-Speech

Traditional TTS services operate on a request-response model that requires the complete text to be assembled before synthesis can begin. This can be a significant hindrance, especially in conversational applications powered by large language models (LLMs), where text is generated incrementally. The result? Users often find themselves waiting for the model to finish generating a complete response before hearing the synthesized audio.

Traditional Limitations

Imagine a virtual assistant powered by an LLM that takes several seconds to generate each response. With the traditional model, users must endure three painful waits:

The LLM finishes generating the complete response.
The TTS service synthesizes the full text.
The audio file is downloaded before playback begins.

These delays can significantly detract from user experience, particularly in applications demanding real-time interactions.

Introducing Bidirectional Streaming

The new Bidirectional Streaming API effectively addresses these pain points. With the StartSpeechSynthesisStream API, you can now:

Send Text Incrementally: Stream text to Amazon Polly as it becomes available, without waiting for complete thoughts.
Receive Audio Immediately: Get synthesized audio in real-time as it’s generated.
Control Synthesis Timing: Use configurations to trigger synthesis immediately.
True Duplex Communication: Send and receive information simultaneously over a single connection.

Key Components

Component	Event Direction	Purpose
TextEvent	Inbound	Send text to be synthesized
CloseStreamEvent	Inbound	Signal the end of text input
AudioEvent	Outbound	Receive synthesized audio chunks
StreamClosedEvent	Outbound	Confirmation of stream completion

Comparing Traditional Methods with Bidirectional Streaming

Traditional Implementations

Previously, achieving low-latency TTS required complicating the architecture:

Server-side text separation logic
Multiple parallel API calls to Amazon Polly
Complex audio reassembly

The Benefits of Native Bidirectional Streaming

With the new API, businesses can enjoy:

No Separation Logic Required: Streamlined processes mean less room for error.
Single Persistent Connection: Reduced overhead makes backend management easier.
Native Streaming: Both text and audio can flow in real-time.
Lower Latency: A significant improvement in efficiency.

Performance Benchmarks

To illustrate the real-world impact of the new API, we’ve benchmarked both the traditional SynthesizeSpeech API and the new StartSpeechSynthesisStream API, processing 7,045 characters (approximately 970 words) in us-west-2. Here’s how they compare:

Metric	Traditional SynthesizeSpeech	Bidirectional Streaming	Improvement
Total processing time	115,226 ms (~115s)	70,071 ms (~70s)	39% faster
API calls	27	1	27x fewer
Sentences sent	27 (sequential)	27 (streamed as words)	—
Total audio bytes	2,354,292	2,324,636	—

The key here is architectural; the bidirectional API allows for simultaneous input text streaming and audio output. This approach reduces overall wait time, leading to a significant enhancement in user engagement.

Technical Implementation

Getting Started

Developers can utilize the bidirectional streaming API through various AWS SDKs, including Java, JavaScript, .NET, and more. Here’s a basic example of how to set up the client:

PollyAsyncClient pollyClient = PollyAsyncClient.builder()
    .region(Region.US_WEST_2)
    .credentialsProvider(DefaultCredentialsProvider.create())
    .build();

Sending Text Events

Text events can be sent using a reactive streams Publisher, allowing for efficient and real-time interactions.

Handling Audio Events

Audio arrives through a response handler, enabling immediate processing of audio chunks as they are generated.

Complete Example: Streaming Text from an LLM

Here’s a practical implementation to showcase the integration of this new API with LLM-generated content:

public class LLMIntegrationExample {
    // Implementation of bidirectional streaming logic here
}

Business Benefits

Improved User Experience

The bidirectional streaming API substantially enhances the user experience:

Reduced Perceived Wait Time: Audio playback begins even while the LLM is generating responses, making interactions feel more seamless.
Higher Engagement: Quicker and more responsive interactions lead to increased user satisfaction.
Streamlined Implementation: A single API call simplifies development, removing unnecessary complexity.

Reduced Operational Costs

Streamlined architecture can lead to significant cost savings:

Cost Factor	Traditional Chunking	Bidirectional Streaming
Infrastructure	WebSocket servers, load balancers	Direct client-to-Polly connection
Development	Custom chunking logic	SDK handles complexity
Maintenance	Multiple components to monitor	Single integration point
API Calls	Multiple calls per request	Single streaming session

By removing intermediate servers, organizations can reduce infrastructure costs and enhance developmental speed.

Use Cases

The bidirectional streaming API is ideal for various applications:

Conversational AI Assistants: Stream LLM responses directly to speech.
Real-time Translation: Synthesize translated text as it’s generated.
IVR Systems: Provide dynamic, responsive phone systems.
Accessibility Tools: Enhance real-time screen readers and TTS applications.
Gaming: Create dynamic dialogue and narration for NPCs.
Live Captioning: Enable audio output for live transcription.

Conclusion

The Bidirectional Streaming API for Amazon Polly marks a significant advancement in real-time speech synthesis. It mitigates latency issues that have long been barriers in conversational AI, enabling far more fluid interactions.

Key Takeaways

Reduced Latency: Instant audio playback as text is generated.
Simplified Architecture: No need for complex workarounds.
Native Integration: Built specifically for LLM streaming.
Flexible Control: Synthesis timing can be finely controlled.

As you embark on building responsive and immersive applications—be they virtual assistants, accessibility tools, or beyond—the bidirectional streaming API stands as a robust foundation for your conversational experiences.

Next Steps

The new Bidirectional Streaming API is now Generally Available. Here’s how to get started:

Update to the latest AWS SDK compatible with the bidirectional streaming API.
Review the API documentation for in-depth details.
Experiment with the provided example code to experience low-latency streaming firsthand.

We can’t wait to see what you build with this powerful new capability. Please share your feedback and use cases with us!

About the Authors

Scott Mishra

Scott is a Sr. Solutions Architect for Amazon Web Services, specializing in generative AI solutions.

Praveen Gadi

Praveen is a Sr. Solutions Architect, focusing on integration solutions and maximizing cloud investments.

Paul Wu

Paul is a Solutions Architect dedicated to helping customers achieve their business objectives through AWS.

Damian Pukaluk

Damian is a Software Development Engineer at AWS Polly, instrumental in delivering innovative TTS solutions.

This groundbreaking Bidirectional Streaming API is set to redefine how developers integrate TTS capabilities into their applications, making interactions smoother, faster, and more natural than ever before.

Exclusive Content:

Unveiling Amazon Polly Bidirectional Streaming: Real-Time Speech Synthesis for Conversational AI Solutions