Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

How Amazon Search Doubled ML Training Efficiency with AWS Batch for SageMaker Jobs

Optimizing GPU Utilization for Machine Learning at Amazon Search with AWS Batch and SageMaker

Unleashing the Power of Machine Learning at Amazon Search


AWS Batch for SageMaker Training Job: Implementation Insights


The Operational Impact: Enhancing Team Performance


Setting Up AWS Batch for SageMaker Training Jobs: A Step-by-Step Guide


Queue Management for SageMaker Training Jobs: Best Practices


Conclusion: Transforming ML Workflows with AWS Batch and SageMaker


About the Authors


Optimizing GPU Instance Utilization with AWS Batch for SageMaker Training Jobs

In this post, we explore how Amazon Search improved GPU instance utilization by implementing AWS Batch for SageMaker training jobs. This managed solution has allowed us to efficiently orchestrate machine learning (ML) training workloads on GPU-accelerated instance families such as P5 and P4. Additionally, we’ll provide a step-by-step guide on how to implement a similar system.

Machine Learning at Amazon Search

At Amazon Search, we rely on hundreds of GPU-accelerated instances to train and evaluate ML models that refine our customers’ product discovery experience. Our scientists often train multiple models simultaneously to identify the best features, model architectures, and hyperparameters that optimize performance.

Previously, we utilized a first-in-first-out (FIFO) queue to manage our model training and evaluation. This approach became increasingly cumbersome as we recognized the need for a more sophisticated method to prioritize jobs. We established three tiers of job priority: production models with high priority, exploratory research with medium priority, and lower priority for hyperparameter sweeps and batch inference.

We also required a system capable of handling interruptions. For example, if a job failed or an instance type became saturated, it was crucial for the job to automatically switch to another compatible instance type while still adhering to our prioritization framework. Lastly, we sought a managed solution to allow our focus to remain on model development, rather than on managing our infrastructure.

After evaluating different options, we adopted AWS Batch for SageMaker training jobs, as it best addressed our needs. The integration between AWS Batch and SageMaker enabled us to run jobs according to our prioritization criteria, allowing our applied scientists to submit multiple jobs concurrently without needing to manage resources manually. This new system increased our peak utilization of GPU-accelerated instances from 40% to over 80%.

AWS Batch for SageMaker Training Job Implementation

We leveraged three key AWS technologies to establish our job queue.

  1. Service Environments: These configured the GPU capacity available for each instance family (e.g., P5 and P4) and were aligned with our reserved capacity in AWS Batch.
  2. Share Identifiers: These prioritized our workloads, ensuring that production jobs had guaranteed access to their capacity.
  3. Amazon CloudWatch: We used this service to monitor our SageMaker training jobs, providing alerting capabilities for critical events or deviations from expected behavior.

Service Environments

We set up service environments representing the total GPU capacity available for each instance family. This was done by defining fixed limits based on our responsibility for reserved capacity in AWS Batch. By allocating portions of this capacity to different production jobs through Share Identifiers, we created a controlled environment for resource usage.

For instance, with 100 GPU instances allocated between two production experiments, we demonstrated fair-share scheduling. If ProdExp2 used only 25 GPU instances, ProdExp1 could utilize the remaining 15, enabling it to scale up. If ProdExp2 needed its full allocation later, the scheduler would balance the loads effectively. This assured both availability and optimal utilization of resources.

Share Identifiers

We used Share Identifiers to allocate fractions of each service environment’s capacity to production experiments. These string tags, applied at the time of job submission, allowed AWS Batch to track usage and enforce fair-share scheduling.

We defined preset Share Identifiers with quotas for initiatives requiring dedicated capacity. These quotas served as fairness targets, allowing unused capacity to be borrowed, while ensuring that contention for resources was managed through preemption.

Prioritization in Share Identifiers

Within each Share Identifier, job priorities (ranging from 0 to 99) influenced execution order. However, priority-based preemption only activated once its allocated capacity was reached. Thus, we ensured that higher-priority jobs always had precedence while maintaining a balanced load.

Amazon CloudWatch

We utilized Amazon CloudWatch to monitor our SageMaker training jobs, tracking job status across various states such as SUBMITTED, RUNNING, and FAILED. By publishing these metrics and statuses to CloudWatch, we maintained operational efficiency without requiring custom monitoring systems.

Operational Impact on Team Performance

Implementing AWS Batch for SageMaker training jobs significantly enhanced operational performance, allowing researchers to run experiments without resource availability concerns. This change not only resulted in shorter queue times but also increased GPU utilization and accelerated model training turnaround times.

Setting Up AWS Batch for SageMaker Training Jobs

To establish a similar environment, you can reference the tutorial that outlines how to orchestrate multiple GPU-intensive training jobs using AWS Batch with SageMaker. Here’s a brief overview of the setup process:

Prerequisites

  1. Clone the GitHub repository with the assets for deployment:

    git clone https://github.com/aws/amazon-sagemaker-examples/
    cd build_and_train_models/sm-training-queues-pytorch/

Create AWS Batch Resources

To automate the creation of the Service Environment, Scheduling Policy, and Job Queue, we provided utility functions in the example. You can navigate the AWS Batch dashboard after executing the following commands:

cd smtj_batch_utils
python create_resources.py

This creates two queues: one FIFO for CPU workloads and another fair-share for GPU workloads with prior weights defined.

Running LLM Fine-Tuning Jobs on SageMaker

For running fine-tuning workloads, you would execute a notebook that submits SageMaker training jobs using AWS Batch, complete with input channels for training and validation datasets.

Conclusion

By implementing AWS Batch for SageMaker training jobs, Amazon Search dramatically improved GPU resource utilization and streamlined training job management. This sophisticated setup allowed for automatic prioritization of workloads, enhancing the model development process significantly.

Organizations facing similar challenges in their ML training infrastructure should explore AWS Batch’s integration with SageMaker. The solution not only simplifies job management but also ensures efficient resource usage.

To begin, access our implementation guide and sample code in the amazon-sagemaker-examples repository on GitHub.

About the Authors

Mona Mona is a Generative AI Specialist Solutions Architect at Amazon, an author, and passionate about leveraging AI for customer solutions.
Mayank Jha is a Senior Machine Learning Engineer at Amazon Search, focused on model training optimization.
Bruno Pistone is a Senior generative AI and ML Specialist Solutions Architect at AWS.
James Park is a Solutions Architect at AWS, specializing in AI and ML implementations.

Thank you for your collaboration in making this post possible!

Latest

Build a Custom Computer Vision Defect Detection Model with Amazon SageMaker

Migrating from Amazon Lookout for Vision to Amazon SageMaker...

OpenAI Refutes Claims Linking ChatGPT to Teenager’s Suicide

OpenAI Responds to Lawsuit Alleging ChatGPT's Role in Teen's...

M-A Robotics Presents a Spectacular Pirate-Themed Mechanical M-Ayhem Event!

Mechanical M-Ayhem: A Thrilling Showcase of Innovation and Team...

Brand Engagement Network Inc. SEC 10-Q Report – TradingView Update

Brand Engagement Network Inc. Q3 2025 Financial and Operational...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Microsoft launches new AI tool to assist finance teams with generative tasks

Microsoft Launches AI Copilot for Finance Teams in Microsoft...

Amazon SageMaker AI Unveils EAGLE-Driven Adaptive Speculative Decoding to Enhance Generative...

Enhancing Generative AI Inference with EAGLE in Amazon SageMaker AI Accelerating Decoding Through Adaptive Speculative Techniques Leveraging EAGLE for Optimized Performance in Large Language Models Flexible Workflow...

Boost Generative AI Innovation in Canada with Amazon Bedrock Cross-Region Inference

Unlocking AI Potential: A Guide to Cross-Region Inference for Canadian Organizations Transforming Operations with Generative AI on Amazon Bedrock Canadian Cross-Region Inference: Your Gateway to Global...

How Care Access Reduced Data Processing Costs by 86% and Increased...

Streamlining Medical Record Analysis: How Care Access Transformed Operations with Amazon Bedrock's Prompt Caching This heading encapsulates the essence of the post, emphasizing the focus...