Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Deploy SageMaker AI Inference Endpoints with Configured GPU Capacity Using Training Plans

A Comprehensive Guide to Optimizing Inference Workloads with Amazon SageMaker AI Training Plans

Introduction to LLMs and GPU Capacity Challenges

Leveraging Amazon SageMaker for Predictable Inference Performance

Step-by-Step Workflow: Reserving GPU Capacity for Inference

Prerequisites for Implementing Training Plans

Detailed Phases of Creating a Training Plan Reservation

Step 1: Search for Available Capacity Offerings

Step 2: Create the Endpoint Configuration with Training Plan Reservation

Step 3: Deploy the Endpoint on Reserved Capacity

Step 4: Invoking the Endpoint During the Active Training Plan

Step 5: Handling Expired Training Plans

Step 6: Updating the Endpoint Configuration

Step 7: Scaling the Endpoint

Step 8: Cleaning Up the Endpoint

Conclusion: Maximizing Efficiency with SageMaker AI Training Plans

Acknowledgments

About the Authors

Ensuring Reliable GPU Capacity for Large Language Model Inference with Amazon SageMaker Training Plans

Deploying large language models (LLMs) for inference poses unique challenges, especially during critical evaluation periods, limited-duration production testing, or burst workloads. Capacity constraints can not only delay deployments but also impact the overall performance of applications, leading to unpredictable outcomes. Fortunately, Amazon SageMaker AI training plans now offer a game-changing solution by allowing customers to reserve compute capacity specifically for inference workloads.

Understanding the Problem: Capacity Constraints

Imagine a data science team tasked with evaluating several fine-tuned language models over two weeks. They require seamless access to powerful GPU instances, like the ml.p5.48xlarge, to conduct comparative benchmarks. However, on-demand capacity can be unpredictable during peak usage times in their chosen AWS Region. In this scenario, ensuring uninterrupted access to the required resources becomes crucial for maintaining evaluation timelines.

Introducing Amazon SageMaker AI Training Plans

Amazon SageMaker AI training plans were initially designed for training workloads but now offer the flexibility to support inference endpoints. This means that organizations can reserve GPU capacity in advance, ensuring predictable availability for time-sensitive inference tasks.

Key Benefits

  • Predictability: Secure GPU resources for the intended duration without concerns about on-demand availability.
  • Cost Control: Create a budget-friendly approach by reserving capacity at a fixed rate for specified time periods.

The Journey of a Data Scientist Using Training Plans

Let’s walk through a typical example of how a data scientist can utilize Amazon SageMaker training plans to reserve capacity for model evaluation and manage their endpoint effectively throughout the reservation lifecycle.

Solution Overview

SageMaker AI training plans allow teams to reserve compute capacity tailored for specific time windows. When creating a training plan, the team specifies the target resource type as "endpoint" to secure p-family GPU instances for inference workloads.

Phases of the Training Plan Workflow

  1. Identify Capacity Requirements: Determine the instance type, instance count, and duration needed for the inference workload.
  2. Search Available Offerings: Query for capacity that matches the requirements and desired time window using the SageMaker API.
  3. Create Reservation: Choose a suitable offering and create the training plan reservation, thereby generating an Amazon Resource Name (ARN).
  4. Deploy and Manage Endpoint: Configure the SageMaker AI endpoint using the reserved capacity and oversee its lifecycle during the reservation period.

Step-by-Step Implementation

Prerequisites

Ensure that you have the following set up:

  • An AWS account with IAM permissions to access SageMaker.
  • Necessary SDKs installed for executing commands.

Step 1: Search for Capacity Offerings and Create a Reservation

The team identifies available p-family GPU capacity matching their evaluation needs. Using the search-training-plan-offerings API calls, they specify parameters that align with their timeline.

Example Command:

aws sagemaker search-training-plan-offerings \
--target-resources "endpoint" \
--instance-type "ml.p5.48xlarge" \
--instance-count 1 \
--duration-hours 168 \
--start-time-after "2025-01-27T15:48:14-04:00" \
--end-time-before "2025-01-31T14:48:14-05:00"

After running the command, they receive a list of available offerings with detailed pricing and availability details.

Step 2: Create the Training Plan Reservation

Once a suitable offering is identified, the team can make a reservation.
Example Command:

aws sagemaker create-training-plan \
--training-plan-offering-id "tpo-SHA-256-hash-value" \
--training-plan-name "p4-for-inference-endpoint"

The reservation generates an ARN essential for linking the endpoint to the reserved capacity.

Step 3: Configure Endpoint with Training Plan Reservation

The team now sets up an endpoint configuration linking to the reserved capacity.
Example Command:

aws sagemaker create-endpoint-config \
--endpoint-config-name "ftp-ep-config" \
--production-variants '[{
"VariantName": "AllTraffic",
"ModelName": "my-model",
"InitialInstanceCount": 1,
"InstanceType": "ml.p5.48xlarge",
"CapacityReservationConfig": {
"CapacityReservationPreference": "capacity-reservations-only",
"MlReservationArn": "arn:aws:sagemaker:us-east-1:123456789123:training-plan/p4-for-inference-endpoint"
}
}]'

Step 4: Deploy the Endpoint

Once the configuration is completed, the next step is to deploy the endpoint.
Example Command:

aws sagemaker create-endpoint \
--endpoint-name "my-endpoint" \
--endpoint-config-name "ftp-ep-config"

The endpoint now runs entirely within the reserved training plan capacity.

Step 5: Invoking the Endpoint During the Reservation

With the endpoint in service, evaluation workloads can commence using the reserved capacity to ensure performance and availability are maintained.
Example Command:

aws sagemaker-runtime invoke-endpoint \
--endpoint-name "my-endpoint" \
--body fileb://input.json \
--content-type "application/json" \
Output.json

Conclusion

Amazon SageMaker AI training plans provide an efficient solution for reserving p-family GPU capacity and deploying SageMaker AI inference endpoints with predictability. Our illustrated data science team benefited from using training plans to execute their week-long model evaluations without interference from capacity constraints.

Whether for competitive benchmarks or limited-duration tests, training plans facilitate a controlled evaluation environment while optimizing costs. Each step—from reserving capacity to managing deployments—underscores the effective management of inference workloads poised for optimized output.

Acknowledgments

Special thanks to the contributors who helped shape this post, emphasizing the importance of collaboration in tech advancements.

About the Authors

Kareem Syed-Mohammed: Product Manager at AWS focusing on Gen AI model development.
Chaoneng Quan: Software Development Engineer on the AWS SageMaker team, optimizing GPU capacity management.
Dan Ferguson: Solutions Architect at AWS, guiding customers through machine learning integrations.

By leveraging these insights and best practices, organizations can streamline their deployment processes and make informed decisions when evaluating language models, ensuring they are equipped to handle the demands of modern AI applications.

Latest

New Movie ‘Lesbian Space Princess’ Set for UK Release Date

Blast Off into Queer Comedy: "Lesbian Space Princess" Set...

10 Best YouTube Channels for Mastering Machine Learning

Your Ultimate Guide to Learning Machine Learning: 10 YouTube...

ChatGPT Faces Challenges in Its Initial Phase as an Advertising Platform

Navigating the Challenges of Advertising in ChatGPT: High Hopes...

Robots Enter Europe’s Greenhouses with €8 Million Investment

German Startup Eternal.ag Secures €8 Million to Innovate Autonomous...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Transforming Security Alerts with Reco and Amazon Bedrock

Transforming Security Alerts with AI: A Deep Dive into Reco’s Implementation of Amazon Bedrock Co-written by Tal Shapira and Tamir Friedman from Reco In this comprehensive...

Create an AI-Driven A/B Testing Engine with Amazon Bedrock

Enhancing A/B Testing with AI: Building a Smart Experimentation Engine on AWS The Challenge with Traditional A/B Testing A Real Scenario: Why Random Assignment Slows You...

Implement Data Residency with Amazon Quick Extensions for Microsoft Teams

Enforcing Data Residency with Amazon Quick and Microsoft 365: A Multi-Region Deployment Guide Overview: Navigating Data Compliance in a Global Landscape Organizations operating across borders face...