Simplifying Distributed Training and Model Deployment with Amazon SageMaker HyperPod CLI and SDK
Introduction
Training and deploying large AI models demands sophisticated distributed computing capabilities. However, managing these systems shouldn’t be overly complicated for data scientists and ML practitioners. The recently released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod streamline the use of distributed training and inference features.
Key Features of the SageMaker HyperPod CLI and SDK
Enhanced User Experience
The SageMaker HyperPod CLI offers an intuitive command-line interface, simplifying interactions with complex distributed systems. Built on the HyperPod SDK, it supports tasks such as job launching, endpoint deployment, and cluster monitoring.
Advanced Programmatic Control
For intricate workflows, the HyperPod SDK provides a Python interface that allows developers to precisely customize ML processes while working with familiar Python objects.
Practical Applications
This post explores how to use both the CLI and SDK to train and deploy large language models (LLMs) on SageMaker HyperPod, demonstrating practical examples of distributed training and inference deployment.
Prerequisites
To follow along, ensure you have the necessary software components and Kubernetes operators installed.
Installation Guide
Install the SageMaker HyperPod CLI
Begin by installing the latest version of the SageMaker HyperPod CLI and SDK. Use the command:
pip install sagemaker-hyperpod
Verify the installation by running:
hyp
Setting the Cluster Context
Use the CLI to list available clusters in your AWS account and set the desired cluster context for commands.
Training Models with SageMaker HyperPod CLI and SDK
Submitting Training Jobs
The SageMaker HyperPod CLI enables straightforward submission of PyTorch model training jobs. Clone the necessary repositories and create Docker images. Use the CLI to create and monitor training jobs.
Debugging Training Jobs
Leverage multiple debugging tools and command flags for comprehensive visibility into the training process.
Deploying Models with SageMaker HyperPod
Deploying SageMaker JumpStart Models
Quickly deploy models available on SageMaker JumpStart using the CLI commands for easy setup.
Deploying Custom Models
Deploy custom models stored on Amazon S3 or FSx for Lustre. Provide essential information, such as model storage locations and container compatibility.
Conclusion
The SageMaker HyperPod CLI and SDK simplify the process of training and deploying large-scale AI models. With features supporting streamlined workflows, flexible development options, robust observability, and production-ready deployment, these tools empower both data scientists and ML engineers.
For further information, consult the SageMaker HyperPod documentation or explore example notebooks.
Simplifying the Training and Deployment of Large AI Models with Amazon SageMaker HyperPod
Training and deploying large AI models often demands high-level distributed computing capabilities. However, managing these distributed systems shouldn’t be overly complicated for data scientists and machine learning (ML) practitioners. The newly released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod aim to simplify the use of distributed training and inference capabilities in the service.
What is Amazon SageMaker HyperPod?
SageMaker HyperPod is a service that streamlines distributed training and inference, allowing developers to run large-scale machine learning models on Kubernetes clusters easily. With the new CLI and SDK, data scientists can focus on model development rather than getting bogged down in infrastructure complexities.
Features of SageMaker HyperPod CLI
The SageMaker HyperPod CLI offers an intuitive command-line experience, abstracting the underlying complexities of distributed systems. Built on the HyperPod SDK, the CLI provides straightforward commands for common workflows, such as:
- Launching training or fine-tuning jobs
- Deploying inference endpoints
- Monitoring cluster performance
This clarity allows for quicker experimentation and iteration, making it ideal for those who need to validate concepts quickly.
Advanced Customization with SageMaker HyperPod SDK
For more advanced use cases requiring fine-grained control, the SageMaker HyperPod SDK enables programmatic access to customize ML workflows. Developers can leverage the SDK’s Python interface to configure training and deployment parameters precisely, all while working with familiar Python objects.
Getting Started: Prerequisites
To follow the examples in this blog post, ensure you have:
- The latest version of the SageMaker HyperPod CLI and SDK installed (version 3.1.0 or above).
You will also need the proper Kubernetes operators in your cluster.
Installing the SageMaker HyperPod CLI
To install the CLI and SDK, run the following command:
pip install sagemaker-hyperpod
After installation, verify the CLI is working with:
hyp
You should see a help message with available commands.
Setting the Cluster Context
The SageMaker HyperPod CLI and SDK interact with the Kubernetes API, and you need to set the cluster context first. Use the CLI to list available clusters:
hyp list-cluster
To set the cluster context, specify the cluster name as input:
hyp set-cluster-context --cluster-name ml-cluster
Training Models with SageMaker HyperPod
You can submit PyTorch model training and fine-tuning jobs to a SageMaker HyperPod cluster using the HyperPod CLI. For example, scheduling a training job for a Meta Llama 3.1 8B model can be done succinctly:
hyp create hyp-pytorch-job \
--job-name fsdp-llama3-1-8b \
--image ${REGISTRY}fsdp:pytorch2.7.1 \
--command '[hyperpodrun, --tee=3, --log_dir=/tmp/hyperpod, --nproc_per_node=1, --nnodes=8, /fsdp/train.py]' \
--args '[...]' \
--environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
...
Monitoring Training Jobs
You can monitor the status of your training job with:
hyp list hyp-pytorch-job
To check the logs of a specific pod:
hyp get-logs hyp-pytorch-job --pod-name fsdp-llama3-1-8b-pod-0
Deploying Models with SageMaker HyperPod
The CLI also offers commands for quick deployment of models for inference, making it easy to deploy both foundation models and custom models.
Deploying SageMaker JumpStart Models
To deploy an FM from SageMaker JumpStart, run:
hyp create hyp-jumpstart-endpoint \
--model-id deepseek-llm-r1-distill-qwen-1-5b \
--instance-type ml.g5.8xlarge \
...
Deploying Custom Models
To deploy custom models with artifacts stored on Amazon S3 or FSx for Lustre, you can use a command like:
hyp create hyp-custom-endpoint \
--endpoint-name my-custom-tinyllama-endpoint \
--model-name tinyllama \
--model-source-type s3 \
...
Clean-Up
To delete your training job and model deployments, use:
hyp delete hyp-pytorch-job --job-name fsdp-llama3-1-8b
hyp delete hyp-jumpstart-endpoint --name deepseek-distill-qwen-endpoint-cli
Conclusion
The Amazon SageMaker HyperPod CLI and SDK present significant improvements for those working on AI models. They simplify the workflows for training and deploying large-scale AI applications, enabling rapid experimentation and robust production-ready deployments. With the new tools, data scientists can streamline their processes and focus on what truly matters: developing innovative AI solutions.
About the Authors
Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for AWS, focusing on delivering AI and ML solutions.
Shweta Singh, a Senior Product Manager at AWS, specializes in the SageMaker Python SDK.
Nicolas Jourdan is a Specialist Solutions Architect at AWS, renowned for his expertise in AI and ML applications across diverse industries.
For further details and examples, check out the SageMaker HyperPod documentation and explore the resources available.