Unlocking Enhanced Performance with Amazon Bedrock Custom Model Import
Overview of Performance Improvements
Experience reduced latency, faster time-to-first-token, and improved throughput with cutting-edge optimizations.
How the Optimization Works
Learn about artifact caching and the process that reduces initialization overhead.
Performance Improvements
Understand the metrics and benchmarks that showcase the benefits of the new optimizations.
Technical Implementation: Compilation Caching Architecture
Delve into the architecture that drives enhanced performance through effective caching strategies.
Benchmarking Setup
Explore the conditions under which performance tests were conducted to validate optimizations.
Performance Metrics Definitions
Gain clarity on the key performance metrics essential for evaluating model efficiency.
Inference Performance Gains
Examine the significant enhancements and gains in inference performance for various models.
Performance Consistency Across Load Conditions
Discover how optimization benefits maintain consistency across varying user loads.
Customer Impact
Analyze the tangible benefits these optimizations bring to businesses using Amazon Bedrock.
Conclusion
Summarize the transformational improvements in model performance and the path forward for users.
About the Authors
Meet the experts behind the latest enhancements and their contributions to the AI and machine learning landscape.
Unleashing Performance: Amazon Bedrock Custom Model Import Optimizations
In today’s data-driven landscape, the need for quick, efficient model deployment and inference has never been more critical. Amazon Bedrock’s latest enhancements to Custom Model Import (CMI) aim to meet this need, providing significant performance improvements. Let’s dive into how these optimizations work and what they mean for developers and businesses alike.
Optimized Performance: The Key Enhancements
The introduction of advanced PyTorch compilation and CUDA graph optimizations has led to remarkable reductions in end-to-end latency and quicker time-to-first-token responses. By enabling users to import their foundation models into Amazon Bedrock, these enhancements allow for scalable deployment and inference.
However, alongside these benefits comes potential model initialization overhead, which can impact container cold-start times. Amazon Bedrock mitigates this through compilation artifact caching. This innovative approach ensures performance improvements without sacrificing expected cold-start metrics, enabling users to enjoy faster model deployment with minimized latency.
Initial Delay vs. Subsequent Instances
When deploying models utilizing these optimizations, customers may initially experience a delay during the first startup of a model instance. This is a one-time initialization delay; subsequent instances will initialize without this overhead, striking a balance between high performance and rapid startup times during scaling.
How the Optimization Works
Caching Compilation Artifacts
The inference engine of Amazon Bedrock now caches compilation artifacts, eliminating redundancy in computational work during startup. The first instance of a model generates optimized computational graphs and kernel configurations, storing them for future reuse.
A unique identifier—formed from model configuration parameters (batch size, context length, etc.)—ensures that these cached artifacts match the requirements of each model instance, promoting correctness and optimal performance. Additional integrity verification safeguards against corruption during transfer or storage, automatically regenerating artifacts when needed while maintaining model availability.
Performance Improvements
Extensive testing across different model sizes and workloads shows the substantial impact of these enhancements. Benchmarks were conducted under production-mimicking conditions, measuring key inference metrics at various concurrency levels from 1 to 32 concurrent requests.
Technical Implementation Breakdown
The improved performance can be credited to several core processes executed during the first model instance’s startup:
-
Computational Graph Optimization: Analyzing neural network architecture to generate an optimized execution plan tailored to specific hardware.
-
Kernel Compilation: Compiling GPU kernels adjusted for the model’s unique configuration, yielding highly optimized CUDA code.
-
Memory Planning: Developing optimal memory allocation strategies to minimize fragmentation and data movement.
Previously, each new model instance required independent execution of these computationally intensive operations, consuming significant initialization time. With the advent of compilation caching, later model instances can retrieve pre-compiled artifacts, drastically improving startup efficiency.
Rigorous Benchmarking Setup
Our benchmarking process involved a controlled environment to isolate the performance enhancements attributed to the compilation caching optimizations. We evaluated:
- Workload Patterns: Medium and large I/O token configurations to mirror real-world applications.
- Concurrency Levels: Assessing how performance holds up under increasing load conditions (1 to 32 concurrent requests).
Captured latency statistics included minimum, maximum, average, and percentile values, providing a thorough view of performance across diverse metrics.
Key Performance Metrics Defined
-
Time to First Token (TTFT): Critical for user experience in interactive applications, indicating how quickly users see initial responses.
-
End-to-End Latency (E2E): Total time from request submission to full response delivery, encompassing all processing stages.
-
Throughput: Total tokens processed per second across concurrent requests, indicating the volume of users served.
-
Output Tokens Per Second (OTPS): The rate of token generation during the response phase, particularly vital for real-time applications.
Inference Performance Gains
The improvements from compilation caching greatly enhance user experiences and infrastructure efficiency. For instance:
Granite 20B Code Model:
- TTFT: Reduced by 87.8% (from 989.9 ms to 120.9 ms).
- E2E Latency: Dropped by 58.8% (from 12,829 ms to 5,290 ms).
- Throughput: Increased by 25% (from 360.5 to 450.8 tokens/sec).
Llama 3.1 8B Instruct Model:
- TTFT: Dropped by 76.7% (from 366.9 ms to 85.5 ms).
- E2E Latency: Improved by 18.4% (from 3,102 ms to 2,532 ms).
- Throughput: Boosted by 29.1% (from 714.3 to 922.0 tokens/sec).
These benchmarks showcase how the optimizations maintain benefits across various model architectures, reinforcing their utility across different applications.
Consistent Performance Across Load Conditions
The optimizations ensure consistent improvements regardless of concurrency levels (1-32 concurrent requests). This reliability is critical during scaling events, where auto-scaling can add new instances without compromising performance. Cached compilation artifacts allow these instances to deliver optimal performance right from the start.
Customer Impact
The benefits seen from these optimizations extend beyond mere performance metrics, enhancing the overall user experience.
- Reduced Latency: Better responsiveness for AI-driven applications such as chatbots and content generators.
- Higher Throughput: Efficient use of existing infrastructure allows service to a greater volume of users.
- Rapid Scaling: Predictable instance initialization contributes to maintaining performance during traffic surges.
Conclusion
Amazon Bedrock’s Custom Model Import now offers transformative enhancements in inference performance. Through advanced optimizations and compilation artifact caching, users can experience lower latency, quicker responses, and increased throughput—all without requiring any special interventions.
For both existing and new users, these capabilities are available to streamline your deployment processes. To unlock these performance improvements, import your custom models to Amazon Bedrock Custom Model Import today, and consult the documentation for guidance on supported model architectures.
About the Authors
Nick McCarthy is a Senior Generative AI Specialist Solutions Architect on the Amazon Bedrock team, focused on model customization. With a passion for technology and science, he brings deep expertise across various industries.
Prashant Patel is a Senior Software Development Engineer dedicated to scaling large language models, with a finance and research background contributing to robust enterprise applications.
Yashowardhan Shinde specializes in large language model inference challenges, blending research insights with engineering know-how to build scalable systems.
Yanyan Zhang is a Senior Generative AI Data Scientist, leveraging cutting-edge technologies to help clients achieve significant results with generative AI.