Optimizing GPU Utilization for Machine Learning at Amazon Search with AWS Batch and SageMaker

Unleashing the Power of Machine Learning at Amazon Search

AWS Batch for SageMaker Training Job: Implementation Insights

The Operational Impact: Enhancing Team Performance

Setting Up AWS Batch for SageMaker Training Jobs: A Step-by-Step Guide

Queue Management for SageMaker Training Jobs: Best Practices

Conclusion: Transforming ML Workflows with AWS Batch and SageMaker

About the Authors

Optimizing GPU Instance Utilization with AWS Batch for SageMaker Training Jobs

In this post, we explore how Amazon Search improved GPU instance utilization by implementing AWS Batch for SageMaker training jobs. This managed solution has allowed us to efficiently orchestrate machine learning (ML) training workloads on GPU-accelerated instance families such as P5 and P4. Additionally, we’ll provide a step-by-step guide on how to implement a similar system.

Machine Learning at Amazon Search

At Amazon Search, we rely on hundreds of GPU-accelerated instances to train and evaluate ML models that refine our customers’ product discovery experience. Our scientists often train multiple models simultaneously to identify the best features, model architectures, and hyperparameters that optimize performance.

Previously, we utilized a first-in-first-out (FIFO) queue to manage our model training and evaluation. This approach became increasingly cumbersome as we recognized the need for a more sophisticated method to prioritize jobs. We established three tiers of job priority: production models with high priority, exploratory research with medium priority, and lower priority for hyperparameter sweeps and batch inference.

We also required a system capable of handling interruptions. For example, if a job failed or an instance type became saturated, it was crucial for the job to automatically switch to another compatible instance type while still adhering to our prioritization framework. Lastly, we sought a managed solution to allow our focus to remain on model development, rather than on managing our infrastructure.

After evaluating different options, we adopted AWS Batch for SageMaker training jobs, as it best addressed our needs. The integration between AWS Batch and SageMaker enabled us to run jobs according to our prioritization criteria, allowing our applied scientists to submit multiple jobs concurrently without needing to manage resources manually. This new system increased our peak utilization of GPU-accelerated instances from 40% to over 80%.

AWS Batch for SageMaker Training Job Implementation

We leveraged three key AWS technologies to establish our job queue.

Service Environments: These configured the GPU capacity available for each instance family (e.g., P5 and P4) and were aligned with our reserved capacity in AWS Batch.
Share Identifiers: These prioritized our workloads, ensuring that production jobs had guaranteed access to their capacity.
Amazon CloudWatch: We used this service to monitor our SageMaker training jobs, providing alerting capabilities for critical events or deviations from expected behavior.

Service Environments

We set up service environments representing the total GPU capacity available for each instance family. This was done by defining fixed limits based on our responsibility for reserved capacity in AWS Batch. By allocating portions of this capacity to different production jobs through Share Identifiers, we created a controlled environment for resource usage.

For instance, with 100 GPU instances allocated between two production experiments, we demonstrated fair-share scheduling. If ProdExp2 used only 25 GPU instances, ProdExp1 could utilize the remaining 15, enabling it to scale up. If ProdExp2 needed its full allocation later, the scheduler would balance the loads effectively. This assured both availability and optimal utilization of resources.

Share Identifiers

We used Share Identifiers to allocate fractions of each service environment’s capacity to production experiments. These string tags, applied at the time of job submission, allowed AWS Batch to track usage and enforce fair-share scheduling.

We defined preset Share Identifiers with quotas for initiatives requiring dedicated capacity. These quotas served as fairness targets, allowing unused capacity to be borrowed, while ensuring that contention for resources was managed through preemption.

Prioritization in Share Identifiers

Within each Share Identifier, job priorities (ranging from 0 to 99) influenced execution order. However, priority-based preemption only activated once its allocated capacity was reached. Thus, we ensured that higher-priority jobs always had precedence while maintaining a balanced load.

Amazon CloudWatch

We utilized Amazon CloudWatch to monitor our SageMaker training jobs, tracking job status across various states such as SUBMITTED, RUNNING, and FAILED. By publishing these metrics and statuses to CloudWatch, we maintained operational efficiency without requiring custom monitoring systems.

Operational Impact on Team Performance

Implementing AWS Batch for SageMaker training jobs significantly enhanced operational performance, allowing researchers to run experiments without resource availability concerns. This change not only resulted in shorter queue times but also increased GPU utilization and accelerated model training turnaround times.

Setting Up AWS Batch for SageMaker Training Jobs

To establish a similar environment, you can reference the tutorial that outlines how to orchestrate multiple GPU-intensive training jobs using AWS Batch with SageMaker. Here’s a brief overview of the setup process:

Prerequisites

Clone the GitHub repository with the assets for deployment:

git clone https://github.com/aws/amazon-sagemaker-examples/
cd build_and_train_models/sm-training-queues-pytorch/

Create AWS Batch Resources

To automate the creation of the Service Environment, Scheduling Policy, and Job Queue, we provided utility functions in the example. You can navigate the AWS Batch dashboard after executing the following commands:

cd smtj_batch_utils
python create_resources.py

This creates two queues: one FIFO for CPU workloads and another fair-share for GPU workloads with prior weights defined.

Running LLM Fine-Tuning Jobs on SageMaker

For running fine-tuning workloads, you would execute a notebook that submits SageMaker training jobs using AWS Batch, complete with input channels for training and validation datasets.

Conclusion

By implementing AWS Batch for SageMaker training jobs, Amazon Search dramatically improved GPU resource utilization and streamlined training job management. This sophisticated setup allowed for automatic prioritization of workloads, enhancing the model development process significantly.

Organizations facing similar challenges in their ML training infrastructure should explore AWS Batch’s integration with SageMaker. The solution not only simplifies job management but also ensures efficient resource usage.

To begin, access our implementation guide and sample code in the amazon-sagemaker-examples repository on GitHub.

About the Authors

Mona Mona is a Generative AI Specialist Solutions Architect at Amazon, an author, and passionate about leveraging AI for customer solutions.
Mayank Jha is a Senior Machine Learning Engineer at Amazon Search, focused on model training optimization.
Bruno Pistone is a Senior generative AI and ML Specialist Solutions Architect at AWS.
James Park is a Solutions Architect at AWS, specializing in AI and ML implementations.

Thank you for your collaboration in making this post possible!

Exclusive Content:

How Amazon Search Doubled ML Training Efficiency with AWS Batch for SageMaker Jobs

Optimizing GPU Utilization for Machine Learning at Amazon Search with AWS Batch and SageMaker

Unleashing the Power of Machine Learning at Amazon Search

AWS Batch for SageMaker Training Job: Implementation Insights

The Operational Impact: Enhancing Team Performance

Setting Up AWS Batch for SageMaker Training Jobs: A Step-by-Step Guide

Queue Management for SageMaker Training Jobs: Best Practices

Conclusion: Transforming ML Workflows with AWS Batch and SageMaker

About the Authors

Optimizing GPU Instance Utilization with AWS Batch for SageMaker Training Jobs

Machine Learning at Amazon Search

AWS Batch for SageMaker Training Job Implementation

Service Environments

Share Identifiers

Prioritization in Share Identifiers

Amazon CloudWatch

Operational Impact on Team Performance

Setting Up AWS Batch for SageMaker Training Jobs

Prerequisites

Create AWS Batch Resources

Running LLM Fine-Tuning Jobs on SageMaker

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe