Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Enhancing Salesforce Model Endpoints Using Amazon SageMaker AI Inference Tools

Optimizing AI Inference: A Collaboration Between Salesforce and AWS


This heading captures the essence of the collaboration, focusing on the optimization of AI inference, which is the main theme of the article.

Optimizing AI Deployment with Salesforce and AWS: Insights on Model Serving Innovations

This post is a joint collaboration between Salesforce and AWS and is being cross-published on both the Salesforce Engineering Blog and the AWS Machine Learning Blog.

In the rapidly evolving realm of artificial intelligence, optimizing model deployment is paramount. The Salesforce AI Platform Model Serving team is at the forefront of this challenge, diligently working to provide robust services for hosting large language models (LLMs) and other AI workloads. This blog post delves into how Salesforce, in partnership with Amazon Web Services (AWS), has harnessed the capabilities of Amazon SageMaker AI to enhance GPU utilization and resource efficiency while achieving significant cost savings.

The Challenge: Balancing Performance and Cost

For organizations of all sizes, deploying machine learning models efficiently and cost-effectively poses numerous challenges. The Salesforce AI Platform team manages various proprietary LLMs, including CodeGen and XGen, utilizing SageMaker AI for optimized inference deployment. With models ranging from a few gigabytes to 30 GB, each with unique performance and infrastructure demands, the team faced two critical challenges:

  1. Underutilization of High-Performance GPUs: Their larger models, deployed on high-performance GPUs, often experienced low traffic patterns leading to resource waste.
  2. High Costs for Mid-Sized Models: Conversely, their medium-sized models required high-throughput processing. These models, however, were often over-provisioned, incurring unnecessary costs.

The stakes were high: the balance between optimizing infrastructure costs and maintaining high AI performance was essential for sustainable growth.

Solution: Leveraging Amazon SageMaker AI Inference Components

To tackle these challenges, Salesforce utilized the inference components of Amazon SageMaker AI, enabling them to deploy multiple foundation models (FMs) on the same endpoint. This approach not only improved resource utilization but also allowed for granular control over resource allocation.

Key Benefits of Inference Components

  • Optimized Resource Management: SageMaker AI efficiently allocates GPU resources, maximizing utilization and driving cost savings.
  • Independent Model Scaling: Each model can scale according to its specific resource needs, ensuring optimal performance without unnecessary expense.
  • Dynamic Instance Scaling: The system can automatically add or remove instances, maintaining availability while minimizing idle compute resources.
  • Flexible Resource Allocation: Organizations can scale down to zero copies for less critical models, freeing resources while keeping essential models ready for traffic.

Configuring and Managing Inference Endpoints

Salesforce’s deployment process involves creating a SageMaker AI endpoint with defined configurations for instance types and initial counts. Using inference components, they can set specific resource requirements for each model, adjusting the number of instances dynamically based on traffic demands.

This intelligent setup maximized GPU utilization and reduced overhead, enabling seamless resource sharing among multiple models. The outcome? A substantial reduction in operational costs while maintaining high-performance standards across the board.

Real-World Application: CodeGen & Inference Components

Salesforce’s suite of proprietary models, like CodeGen, is leveraged in various applications to assist developers in efficient coding practices. By using inference components, the company was able to efficiently host multiple model variants on a unified endpoint, optimizing both performance and cost-management strategies.

Benefits Seen Post-Implementation

  • Optimized Resource Allocation: Efficient sharing of GPU resources across models eliminates unnecessary provisioning.
  • Cost Savings: The dynamic scaling capabilities have led to significant reductions in infrastructure costs.
  • Enhanced Performance: Smaller models benefited from high-performance GPUs, achieving low latency without an increase in operational expenses.

Conclusion: Future-Proofing AI Infrastructure

Through the strategic implementation of Amazon SageMaker AI inference components, Salesforce has redefined its AI infrastructure management, achieving impressive cost reduction and performance enhancement metrics. The ability to pack models intelligently and allocate resources dynamically has positioned Salesforce to thrive in a competitive landscape.

Looking ahead, Salesforce plans to utilize advanced capabilities such as SageMaker AI’s rolling updates for inference endpoints, streamlining model updates while minimizing operational overhead. This forward-thinking strategy not only enhances deployment efficiency but also paves the way for future AI innovations.

For further insights, check out our detailed articles on high-performance model deployment and getting started with Amazon SageMaker AI.


About the Authors

Rishu Aggarwal: Director of Engineering at Salesforce, focusing on LLM deployment and optimization.

Rielah De Jesus: Principal Solutions Architect at AWS, advocate for cloud migration and technical advisor for enterprise customers.

Pavithra Hariharasudhan: Senior Technical Account Manager at AWS, committed to operational excellence in cloud operations.

Ruchita Jadav: Senior Member of Technical Staff at Salesforce with a focus on scalable AI solutions and inference optimization.

Marc Karp: ML Architect at the Amazon SageMaker Service team, dedicated to designing and managing ML workloads effectively.

Latest

Expediting Genomic Variant Analysis Using AWS HealthOmics and Amazon Bedrock AgentCore

Transforming Genomic Analysis with AI: Bridging Data Complexity and...

ChatGPT Collaboration Propels Target into AI-Driven Retail — Retail Technology Innovation Hub

Transforming Retail: Target's Ambitious AI Integration and the Launch...

Alphabet’s Intrinsic and Foxconn Aim to Enhance Factory Automation with Advanced Robotics

Intrinsic and Foxconn Join Forces to Revolutionize Manufacturing with...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Microsoft launches new AI tool to assist finance teams with generative tasks

Microsoft Launches AI Copilot for Finance Teams in Microsoft...

MSD Investigates How Generative AI and AWS Services Can Enhance Deviation...

Transforming Deviation Management in Biopharmaceuticals: Harnessing Generative AI and Emerging Technologies at MSD Transforming Deviation Management in Biopharmaceutical Manufacturing with Generative AI Co-written by Hossein Salami...

Best Practices and Deployment Patterns for Claude Code Using Amazon Bedrock

Deploying Claude Code with Amazon Bedrock: A Comprehensive Guide for Enterprises Unlock the power of AI-driven coding assistance with this step-by-step guide to deploying Claude...

Bringing Tic-Tac-Toe to Life Using AWS AI Solutions

Exploring RoboTic-Tac-Toe: A Fusion of LLMs, Robotics, and AWS Technologies An Interactive Experience Solution Overview Hardware and Software Strands Agents in Action Supervisor Agent Move Agent Game Agent Powering Robot Navigation with...