Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Enhancing Salesforce Model Endpoints Using Amazon SageMaker AI Inference Tools

Optimizing AI Inference: A Collaboration Between Salesforce and AWS


This heading captures the essence of the collaboration, focusing on the optimization of AI inference, which is the main theme of the article.

Optimizing AI Deployment with Salesforce and AWS: Insights on Model Serving Innovations

This post is a joint collaboration between Salesforce and AWS and is being cross-published on both the Salesforce Engineering Blog and the AWS Machine Learning Blog.

In the rapidly evolving realm of artificial intelligence, optimizing model deployment is paramount. The Salesforce AI Platform Model Serving team is at the forefront of this challenge, diligently working to provide robust services for hosting large language models (LLMs) and other AI workloads. This blog post delves into how Salesforce, in partnership with Amazon Web Services (AWS), has harnessed the capabilities of Amazon SageMaker AI to enhance GPU utilization and resource efficiency while achieving significant cost savings.

The Challenge: Balancing Performance and Cost

For organizations of all sizes, deploying machine learning models efficiently and cost-effectively poses numerous challenges. The Salesforce AI Platform team manages various proprietary LLMs, including CodeGen and XGen, utilizing SageMaker AI for optimized inference deployment. With models ranging from a few gigabytes to 30 GB, each with unique performance and infrastructure demands, the team faced two critical challenges:

  1. Underutilization of High-Performance GPUs: Their larger models, deployed on high-performance GPUs, often experienced low traffic patterns leading to resource waste.
  2. High Costs for Mid-Sized Models: Conversely, their medium-sized models required high-throughput processing. These models, however, were often over-provisioned, incurring unnecessary costs.

The stakes were high: the balance between optimizing infrastructure costs and maintaining high AI performance was essential for sustainable growth.

Solution: Leveraging Amazon SageMaker AI Inference Components

To tackle these challenges, Salesforce utilized the inference components of Amazon SageMaker AI, enabling them to deploy multiple foundation models (FMs) on the same endpoint. This approach not only improved resource utilization but also allowed for granular control over resource allocation.

Key Benefits of Inference Components

  • Optimized Resource Management: SageMaker AI efficiently allocates GPU resources, maximizing utilization and driving cost savings.
  • Independent Model Scaling: Each model can scale according to its specific resource needs, ensuring optimal performance without unnecessary expense.
  • Dynamic Instance Scaling: The system can automatically add or remove instances, maintaining availability while minimizing idle compute resources.
  • Flexible Resource Allocation: Organizations can scale down to zero copies for less critical models, freeing resources while keeping essential models ready for traffic.

Configuring and Managing Inference Endpoints

Salesforce’s deployment process involves creating a SageMaker AI endpoint with defined configurations for instance types and initial counts. Using inference components, they can set specific resource requirements for each model, adjusting the number of instances dynamically based on traffic demands.

This intelligent setup maximized GPU utilization and reduced overhead, enabling seamless resource sharing among multiple models. The outcome? A substantial reduction in operational costs while maintaining high-performance standards across the board.

Real-World Application: CodeGen & Inference Components

Salesforce’s suite of proprietary models, like CodeGen, is leveraged in various applications to assist developers in efficient coding practices. By using inference components, the company was able to efficiently host multiple model variants on a unified endpoint, optimizing both performance and cost-management strategies.

Benefits Seen Post-Implementation

  • Optimized Resource Allocation: Efficient sharing of GPU resources across models eliminates unnecessary provisioning.
  • Cost Savings: The dynamic scaling capabilities have led to significant reductions in infrastructure costs.
  • Enhanced Performance: Smaller models benefited from high-performance GPUs, achieving low latency without an increase in operational expenses.

Conclusion: Future-Proofing AI Infrastructure

Through the strategic implementation of Amazon SageMaker AI inference components, Salesforce has redefined its AI infrastructure management, achieving impressive cost reduction and performance enhancement metrics. The ability to pack models intelligently and allocate resources dynamically has positioned Salesforce to thrive in a competitive landscape.

Looking ahead, Salesforce plans to utilize advanced capabilities such as SageMaker AI’s rolling updates for inference endpoints, streamlining model updates while minimizing operational overhead. This forward-thinking strategy not only enhances deployment efficiency but also paves the way for future AI innovations.

For further insights, check out our detailed articles on high-performance model deployment and getting started with Amazon SageMaker AI.


About the Authors

Rishu Aggarwal: Director of Engineering at Salesforce, focusing on LLM deployment and optimization.

Rielah De Jesus: Principal Solutions Architect at AWS, advocate for cloud migration and technical advisor for enterprise customers.

Pavithra Hariharasudhan: Senior Technical Account Manager at AWS, committed to operational excellence in cloud operations.

Ruchita Jadav: Senior Member of Technical Staff at Salesforce with a focus on scalable AI solutions and inference optimization.

Marc Karp: ML Architect at the Amazon SageMaker Service team, dedicated to designing and managing ML workloads effectively.

Latest

Creating a Personal Productivity Assistant Using GLM-5

From Idea to Reality: Building a Personal Productivity Agent...

Lawsuits Claim ChatGPT Contributed to Suicide and Psychosis

The Dark Side of AI: ChatGPT's Alleged Role in...

Japan’s Robotics Sector Hits Record Orders Amid Growing Global Labor Shortages

Japan's Robotics Boom: Navigating Labor Shortages and Global Competition Add...

Analysis of Major Market Segments Fueling the Digital Language Sector

Exploring the Rapid Growth of the Digital Language Learning...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Apple Stock 2026 Outlook: Price Target and Investment Thesis for AAPL

Institutional Equity Research Report: Apple Inc. (AAPL) Analysis Report Overview Report Date: February 27, 2026 Analyst: Lead Equity Research Analyst Rating: HOLD 12-Month Price Target: $295 Data Sources All data sourced...

Optimize Deployment of Multiple Fine-Tuned Models Using vLLM on Amazon SageMaker...

Optimizing Multi-Low-Rank Adaptation for Mixture of Experts Models in vLLM This heading encapsulates the main focus of the content, highlighting both the technical aspect of...

Create a Smart Photo Search Solution with Amazon Rekognition, Amazon Neptune,...

Building an Intelligent Photo Search System on AWS Overview of Challenges and Solutions Comprehensive Photo Search System with AWS CDK Key Features and Use Cases Technical Architecture and...