Optimizing AI Inference: A Collaboration Between Salesforce and AWS
This heading captures the essence of the collaboration, focusing on the optimization of AI inference, which is the main theme of the article.
Optimizing AI Deployment with Salesforce and AWS: Insights on Model Serving Innovations
This post is a joint collaboration between Salesforce and AWS and is being cross-published on both the Salesforce Engineering Blog and the AWS Machine Learning Blog.
In the rapidly evolving realm of artificial intelligence, optimizing model deployment is paramount. The Salesforce AI Platform Model Serving team is at the forefront of this challenge, diligently working to provide robust services for hosting large language models (LLMs) and other AI workloads. This blog post delves into how Salesforce, in partnership with Amazon Web Services (AWS), has harnessed the capabilities of Amazon SageMaker AI to enhance GPU utilization and resource efficiency while achieving significant cost savings.
The Challenge: Balancing Performance and Cost
For organizations of all sizes, deploying machine learning models efficiently and cost-effectively poses numerous challenges. The Salesforce AI Platform team manages various proprietary LLMs, including CodeGen and XGen, utilizing SageMaker AI for optimized inference deployment. With models ranging from a few gigabytes to 30 GB, each with unique performance and infrastructure demands, the team faced two critical challenges:
- Underutilization of High-Performance GPUs: Their larger models, deployed on high-performance GPUs, often experienced low traffic patterns leading to resource waste.
- High Costs for Mid-Sized Models: Conversely, their medium-sized models required high-throughput processing. These models, however, were often over-provisioned, incurring unnecessary costs.
The stakes were high: the balance between optimizing infrastructure costs and maintaining high AI performance was essential for sustainable growth.
Solution: Leveraging Amazon SageMaker AI Inference Components
To tackle these challenges, Salesforce utilized the inference components of Amazon SageMaker AI, enabling them to deploy multiple foundation models (FMs) on the same endpoint. This approach not only improved resource utilization but also allowed for granular control over resource allocation.
Key Benefits of Inference Components
- Optimized Resource Management: SageMaker AI efficiently allocates GPU resources, maximizing utilization and driving cost savings.
- Independent Model Scaling: Each model can scale according to its specific resource needs, ensuring optimal performance without unnecessary expense.
- Dynamic Instance Scaling: The system can automatically add or remove instances, maintaining availability while minimizing idle compute resources.
- Flexible Resource Allocation: Organizations can scale down to zero copies for less critical models, freeing resources while keeping essential models ready for traffic.
Configuring and Managing Inference Endpoints
Salesforce’s deployment process involves creating a SageMaker AI endpoint with defined configurations for instance types and initial counts. Using inference components, they can set specific resource requirements for each model, adjusting the number of instances dynamically based on traffic demands.
This intelligent setup maximized GPU utilization and reduced overhead, enabling seamless resource sharing among multiple models. The outcome? A substantial reduction in operational costs while maintaining high-performance standards across the board.
Real-World Application: CodeGen & Inference Components
Salesforce’s suite of proprietary models, like CodeGen, is leveraged in various applications to assist developers in efficient coding practices. By using inference components, the company was able to efficiently host multiple model variants on a unified endpoint, optimizing both performance and cost-management strategies.
Benefits Seen Post-Implementation
- Optimized Resource Allocation: Efficient sharing of GPU resources across models eliminates unnecessary provisioning.
- Cost Savings: The dynamic scaling capabilities have led to significant reductions in infrastructure costs.
- Enhanced Performance: Smaller models benefited from high-performance GPUs, achieving low latency without an increase in operational expenses.
Conclusion: Future-Proofing AI Infrastructure
Through the strategic implementation of Amazon SageMaker AI inference components, Salesforce has redefined its AI infrastructure management, achieving impressive cost reduction and performance enhancement metrics. The ability to pack models intelligently and allocate resources dynamically has positioned Salesforce to thrive in a competitive landscape.
Looking ahead, Salesforce plans to utilize advanced capabilities such as SageMaker AI’s rolling updates for inference endpoints, streamlining model updates while minimizing operational overhead. This forward-thinking strategy not only enhances deployment efficiency but also paves the way for future AI innovations.
For further insights, check out our detailed articles on high-performance model deployment and getting started with Amazon SageMaker AI.
About the Authors
Rishu Aggarwal: Director of Engineering at Salesforce, focusing on LLM deployment and optimization.
Rielah De Jesus: Principal Solutions Architect at AWS, advocate for cloud migration and technical advisor for enterprise customers.
Pavithra Hariharasudhan: Senior Technical Account Manager at AWS, committed to operational excellence in cloud operations.
Ruchita Jadav: Senior Member of Technical Staff at Salesforce with a focus on scalable AI solutions and inference optimization.
Marc Karp: ML Architect at the Amazon SageMaker Service team, dedicated to designing and managing ML workloads effectively.