Accelerate Your Generative Artificial Intelligence Models with Sub-minute Scaling Metrics in Amazon SageMaker Inference
Accelerating Auto Scaling for Generative AI Models with Amazon SageMaker
Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. You can now use sub-minute metrics and significantly reduce overall scaling latency for generative AI models. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates.
Challenges in Generative AI Inference Deployment
The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto-scaling to maintain business continuity. Organizations implementing generative AI seek comprehensive solutions that address reducing infrastructure costs, minimizing latency, and optimizing throughput to meet the demands of these sophisticated models.
SageMaker offers industry-leading capabilities to address these inference challenges. It provides endpoints for generative AI inference that optimize the use of accelerators, reducing deployment costs and latency. The SageMaker inference optimization toolkit can deliver higher throughput while reducing costs for generative AI performance. In addition, SageMaker inference provides streaming support for LLMs, enabling real-time token streaming for lower perceived latency and more responsive AI experiences.
Faster Auto Scaling Metrics
To optimize real-time inference workloads, SageMaker employs Application Auto Scaling, dynamically adjusting the number of instances and model copies based on real-time demand changes. With the introduction of two new sub-minute Amazon CloudWatch metrics – ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy – SageMaker now provides a more direct and accurate representation of the system load, enabling faster auto scaling responses to increased demand.
By using these high-resolution metrics, you can achieve significantly faster auto scaling, reducing detection time and improving the overall scale-out time of generative AI models. This capability is crucial for handling fluctuations in request volumes and maintaining optimal performance by minimizing queuing delays.
Components of Auto Scaling
The auto scaling process in SageMaker real-time inference endpoints involves monitoring traffic, triggering scaling actions, provisioning new instances, and load balancing requests across scaled-out resources. Application Auto Scaling supports both target tracking and step scaling policies, allowing for efficient scaling in response to fluctuations in demand.
By leveraging these new sub-minute metrics and auto scaling policies, you can significantly reduce the time it takes to scale up an endpoint, ensuring optimal performance for generative AI models.
Get Started with Faster Auto Scaling
Implementing these new metrics for faster auto scaling is straightforward. By defining scalable targets and setting up target tracking or step scaling policies in Application Auto Scaling, you can leverage the benefits of faster scale-out events for your generative AI models.
Additionally, utilizing SageMaker inference components for deploying multiple generative AI models on a single endpoint further enhances the scalability and efficiency of your AI workloads. By combining concurrency-based and invocation-based auto scaling policies, you can achieve a more adaptive and efficient scaling behavior for your container-based applications.
Sample Runs and Results
Through sample runs with Meta Llama models, we have observed significant improvements in the time required to invoke scale-out events. The introduction of ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics has reduced the overall end-to-end scale-out time, enhancing the responsiveness and efficiency of generative AI model deployments on SageMaker endpoints.
Conclusion
By leveraging the new metrics and auto scaling capabilities in Amazon SageMaker, you can optimize the performance and cost-efficiency of your generative AI models. We encourage you to try out these new features and explore their benefits for your AI workloads. For detailed implementation steps and sample notebooks, visit our GitHub repository.
About the Authors
James Park, Praveen Chamarthi, Dr. Changsha Ma, Saurabh Trikande, Kunal Shah, and Marc Karp are experts in AI/ML and cloud computing at Amazon Web Services. Their collective experience and expertise contribute to the development of innovative solutions for machine learning workloads on AWS.