Efficiently Configuring Amazon EKS for Large-Scale Distributed Training of Large Language Models
Overview of the Infrastructure and Workflow
Solution Overview
Prerequisites
Building Docker Image with AWS DLC
Launching EKS Cluster
Installing Training-Specific Plugins
Verifying Plugins for Distributed Training
Validating Training Environment with Sample Workload
Cleaning Up Resources
Conclusion
About the Authors
Training Large Language Models with Amazon EKS: A Step-by-Step Guide
Training state-of-the-art large language models (LLMs) has emerged as a frontier of machine learning advancement, but it demands robust infrastructure and significant resources. A case in point is Meta’s Llama 3, which utilized an astounding 16,000 NVIDIA H100 GPUs and consumed over 30.84 million GPU hours. In response to the need for scalable and efficient training infrastructure, Amazon Elastic Kubernetes Service (EKS) and AWS Deep Learning Containers (DLCs) provide powerful solutions.
Why Use Amazon EKS and DLCs for LLM Training?
Amazon EKS simplifies the deployment, management, and scaling of Kubernetes clusters, making it easier to configure the massive distributed compute infrastructure required for LLM training. AWS DLCs come with pre-built, performance-optimized images for frameworks like PyTorch, allowing teams to kickstart training jobs quickly and reduce compatibility issues.
The Complexity of Configuring Training Clusters
While these services greatly streamline the process, configuring clusters for large training workloads is still complex. One major challenge is configuring the GPUs on the instances. Amazon EC2 offers two GPU-powered instance families:
- G Family (e.g., G6): Cost-efficient for lighter training and inference.
- P Family (e.g., P6): Designed for large distributed jobs, featuring high memory bandwidth and low-latency networking.
While G instances are more affordable, they lack the high performance necessary for extreme scale tasks. P instances, on the other hand, require meticulous setup of networking, storage, and GPU topologies, which can lead to potential misconfigurations.
A Systematic Approach to Configuring EKS for LLM Training
In this post, we’ll outline a systematic approach to setting up an EKS cluster for training large models using AWS DLCs. This includes:
- Building a Docker Image with Dependencies
- Launching a Stable, GPU-Ready Cluster
- Installing Task-Specific Plugins
- Running Health Checks to Verify Configuration
- Launching a Small Training Job for Validation
Prerequisites
To deploy this solution, ensure you have the following:
- An AWS account with billing enabled
- Sufficient service quotas for on-demand G instances or a capacity reservation
- A Hugging Face token for accessing Meta Llama 2 7B
Step 1: Build a Docker Image Using AWS DLC
AWS DLCs provide a fully integrated stack with compatible versions of CUDA, cuDNN, and NCCL, making it easier to run frameworks like PyTorch on AWS. Building from scratch can be tedious and prone to errors, so it’s best to use DLCs as a foundation and extend them for specific workloads.
Steps to Build the Docker Image:
- Launch an EC2 Instance: Use the “Deep Learning Base OSS Nvidia Driver GPU AMI (Ubuntu 24.04)” for building.
- Install Required Tools: Install AWS CLI, kubectl, and eksctl.
- Clone GitHub Repository: Access necessary scripts for configuration.
- Authenticate and Build the Image: Use Docker to build the custom image and push it to a private repository.
Step 2: Launch EKS Cluster
Using a YAML file, we’ll launch an EKS cluster with the required infrastructure. This includes:
- System Node Group: For system pods.
- GPU Node Group: Designed for distributed training.
Update your AWS region, availability zones, and VPC/Subnet IDs in the YAML file, then execute:
eksctl create cluster -f ./eks-p4d-odcr.yaml
Step 3: Install Training-Specific Plugins
To enable critical functionalities for distributed training workloads, install the following plugins:
- NVIDIA Device Plugin: Ensures GPUs are available.
- EFA Plugin: For high-speed networking.
- Distributed Training Plugins: Such as etcd and the Kubeflow Training Operator, to facilitate multi-node training.
Step 4: Verify Plugins for Distributed Training
Before launching large-scale jobs, it’s vital to validate the environment:
- Check GPU Drivers and NVIDIA-SMI: Verify GPU status and driver versions.
- Validate NCCL: Ensure optimal multi-node communication.
- Run Health Checks: Confirm GPUs, drivers, and necessary plugins are operational.
Step 5: Validate with a Sample Workload
Finally, run a small training job to validate the environment. This could include supervised fine-tuning on a model like Meta Llama 2. Verify successful job execution by checking pod statuses and logs.
Conclusion
The deployment of a well-configured EKS cluster optimized for deep learning workloads can unlock significant potential for training large language models. By following a systematic approach and leveraging AWS’s powerful services, teams can shift their focus from infrastructure management to advancing their models’ performance.
For in-depth scripts and additional resources, refer to our GitHub repository. With this setup, you’ll be well-equipped to embark on your journey into large-scale machine learning.
About the Authors
Meryem Ozcelik, Pratik Yeole, Felipe Lopez, Jinyan Li, and Sirut “G” Buasai are specialists at AWS with extensive experience in AI/ML solutions, container services, and deep learning optimizations. Their insights guide teams in leveraging AWS for effective AI adoption and scalable ML infrastructure design.