Accelerate Your Generative Artificial Intelligence Models with Sub-minute Scaling Metrics in Amazon SageMaker Inference

Accelerating Auto Scaling for Generative AI Models with Amazon SageMaker

Today, we are excited to announce a new capability in Amazon SageMaker inference that can help you reduce the time it takes for your generative artificial intelligence (AI) models to scale automatically. You can now use sub-minute metrics and significantly reduce overall scaling latency for generative AI models. With this enhancement, you can improve the responsiveness of your generative AI applications as demand fluctuates.

Challenges in Generative AI Inference Deployment

The rise of foundation models (FMs) and large language models (LLMs) has brought new challenges to generative AI inference deployment. These advanced models often take seconds to process, while sometimes handling only a limited number of concurrent requests. This creates a critical need for rapid detection and auto-scaling to maintain business continuity. Organizations implementing generative AI seek comprehensive solutions that address reducing infrastructure costs, minimizing latency, and optimizing throughput to meet the demands of these sophisticated models.

SageMaker offers industry-leading capabilities to address these inference challenges. It provides endpoints for generative AI inference that optimize the use of accelerators, reducing deployment costs and latency. The SageMaker inference optimization toolkit can deliver higher throughput while reducing costs for generative AI performance. In addition, SageMaker inference provides streaming support for LLMs, enabling real-time token streaming for lower perceived latency and more responsive AI experiences.

Faster Auto Scaling Metrics

To optimize real-time inference workloads, SageMaker employs Application Auto Scaling, dynamically adjusting the number of instances and model copies based on real-time demand changes. With the introduction of two new sub-minute Amazon CloudWatch metrics – ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy – SageMaker now provides a more direct and accurate representation of the system load, enabling faster auto scaling responses to increased demand.

By using these high-resolution metrics, you can achieve significantly faster auto scaling, reducing detection time and improving the overall scale-out time of generative AI models. This capability is crucial for handling fluctuations in request volumes and maintaining optimal performance by minimizing queuing delays.

Components of Auto Scaling

The auto scaling process in SageMaker real-time inference endpoints involves monitoring traffic, triggering scaling actions, provisioning new instances, and load balancing requests across scaled-out resources. Application Auto Scaling supports both target tracking and step scaling policies, allowing for efficient scaling in response to fluctuations in demand.

By leveraging these new sub-minute metrics and auto scaling policies, you can significantly reduce the time it takes to scale up an endpoint, ensuring optimal performance for generative AI models.

Get Started with Faster Auto Scaling

Implementing these new metrics for faster auto scaling is straightforward. By defining scalable targets and setting up target tracking or step scaling policies in Application Auto Scaling, you can leverage the benefits of faster scale-out events for your generative AI models.

Additionally, utilizing SageMaker inference components for deploying multiple generative AI models on a single endpoint further enhances the scalability and efficiency of your AI workloads. By combining concurrency-based and invocation-based auto scaling policies, you can achieve a more adaptive and efficient scaling behavior for your container-based applications.

Sample Runs and Results

Through sample runs with Meta Llama models, we have observed significant improvements in the time required to invoke scale-out events. The introduction of ConcurrentRequestsPerModel and ConcurrentRequestsPerCopy metrics has reduced the overall end-to-end scale-out time, enhancing the responsiveness and efficiency of generative AI model deployments on SageMaker endpoints.

Conclusion

By leveraging the new metrics and auto scaling capabilities in Amazon SageMaker, you can optimize the performance and cost-efficiency of your generative AI models. We encourage you to try out these new features and explore their benefits for your AI workloads. For detailed implementation steps and sample notebooks, visit our GitHub repository.

About the Authors

James Park, Praveen Chamarthi, Dr. Changsha Ma, Saurabh Trikande, Kunal Shah, and Marc Karp are experts in AI/ML and cloud computing at Amazon Web Services. Their collective experience and expertise contribute to the development of innovative solutions for machine learning workloads on AWS.

Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Running Your ML Notebook on Databricks: A Step-by-Step Guide

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Generative AI models on Amazon SageMaker now benefit from faster auto scaling for inference launch

Accelerate Your Generative Artificial Intelligence Models with Sub-minute Scaling Metrics in Amazon SageMaker Inference

Accelerating Auto Scaling for Generative AI Models with Amazon SageMaker

Challenges in Generative AI Inference Deployment

Faster Auto Scaling Metrics

Components of Auto Scaling

Get Started with Faster Auto Scaling

Sample Runs and Results

Conclusion

About the Authors

Latest

Create Real-Time Voice Streaming Apps Using Amazon Nova Sonic and WebRTC

ChatGPT Introduces ‘Trusted Contact’ Feature

Disney Unveils Imagineering’s Robotics Lab During Week of Wishes, Revealing the Magic Behind Next-Gen Characters

NANC Traders Outperform the Competition by 33 Points as the Gap Widens

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Running Your ML Notebook on Databricks: A Step-by-Step Guide

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

VOXI UK Launches First AI Chatbot to Support Customers

Create Real-Time Voice Streaming Apps Using Amazon Nova Sonic and WebRTC

Transforming Isolated Data into Cohesive Insights: Cross-Account Athena Access for Amazon...

Real-Time Voice Agents Using Stream Vision Agents and Amazon Nova 2...

Popular categories

Most recent

Create Real-Time Voice Streaming Apps Using Amazon Nova Sonic and WebRTC

ChatGPT Introduces ‘Trusted Contact’ Feature

Disney Unveils Imagineering’s Robotics Lab During Week of Wishes, Revealing the Magic Behind Next-Gen Characters

Most popular

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Running Your ML Notebook on Databricks: A Step-by-Step Guide

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Subscribe