Simplifying Distributed Training and Model Deployment with Amazon SageMaker HyperPod CLI and SDK

Introduction

Training and deploying large AI models demands sophisticated distributed computing capabilities. However, managing these systems shouldn’t be overly complicated for data scientists and ML practitioners. The recently released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod streamline the use of distributed training and inference features.

Key Features of the SageMaker HyperPod CLI and SDK

Enhanced User Experience

The SageMaker HyperPod CLI offers an intuitive command-line interface, simplifying interactions with complex distributed systems. Built on the HyperPod SDK, it supports tasks such as job launching, endpoint deployment, and cluster monitoring.

Advanced Programmatic Control

For intricate workflows, the HyperPod SDK provides a Python interface that allows developers to precisely customize ML processes while working with familiar Python objects.

Practical Applications

This post explores how to use both the CLI and SDK to train and deploy large language models (LLMs) on SageMaker HyperPod, demonstrating practical examples of distributed training and inference deployment.

Prerequisites

To follow along, ensure you have the necessary software components and Kubernetes operators installed.

Installation Guide

Install the SageMaker HyperPod CLI

Begin by installing the latest version of the SageMaker HyperPod CLI and SDK. Use the command:

pip install sagemaker-hyperpod

Verify the installation by running:

hyp

Setting the Cluster Context

Use the CLI to list available clusters in your AWS account and set the desired cluster context for commands.

Training Models with SageMaker HyperPod CLI and SDK

Submitting Training Jobs

The SageMaker HyperPod CLI enables straightforward submission of PyTorch model training jobs. Clone the necessary repositories and create Docker images. Use the CLI to create and monitor training jobs.

Debugging Training Jobs

Leverage multiple debugging tools and command flags for comprehensive visibility into the training process.

Deploying Models with SageMaker HyperPod

Deploying SageMaker JumpStart Models

Quickly deploy models available on SageMaker JumpStart using the CLI commands for easy setup.

Deploying Custom Models

Deploy custom models stored on Amazon S3 or FSx for Lustre. Provide essential information, such as model storage locations and container compatibility.

Conclusion

The SageMaker HyperPod CLI and SDK simplify the process of training and deploying large-scale AI models. With features supporting streamlined workflows, flexible development options, robust observability, and production-ready deployment, these tools empower both data scientists and ML engineers.

For further information, consult the SageMaker HyperPod documentation or explore example notebooks.

Simplifying the Training and Deployment of Large AI Models with Amazon SageMaker HyperPod

Training and deploying large AI models often demands high-level distributed computing capabilities. However, managing these distributed systems shouldn’t be overly complicated for data scientists and machine learning (ML) practitioners. The newly released command line interface (CLI) and software development kit (SDK) for Amazon SageMaker HyperPod aim to simplify the use of distributed training and inference capabilities in the service.

What is Amazon SageMaker HyperPod?

SageMaker HyperPod is a service that streamlines distributed training and inference, allowing developers to run large-scale machine learning models on Kubernetes clusters easily. With the new CLI and SDK, data scientists can focus on model development rather than getting bogged down in infrastructure complexities.

Features of SageMaker HyperPod CLI

The SageMaker HyperPod CLI offers an intuitive command-line experience, abstracting the underlying complexities of distributed systems. Built on the HyperPod SDK, the CLI provides straightforward commands for common workflows, such as:

Launching training or fine-tuning jobs
Deploying inference endpoints
Monitoring cluster performance

This clarity allows for quicker experimentation and iteration, making it ideal for those who need to validate concepts quickly.

Advanced Customization with SageMaker HyperPod SDK

For more advanced use cases requiring fine-grained control, the SageMaker HyperPod SDK enables programmatic access to customize ML workflows. Developers can leverage the SDK’s Python interface to configure training and deployment parameters precisely, all while working with familiar Python objects.

Getting Started: Prerequisites

To follow the examples in this blog post, ensure you have:

The latest version of the SageMaker HyperPod CLI and SDK installed (version 3.1.0 or above).

You will also need the proper Kubernetes operators in your cluster.

Installing the SageMaker HyperPod CLI

To install the CLI and SDK, run the following command:

pip install sagemaker-hyperpod

After installation, verify the CLI is working with:

hyp

You should see a help message with available commands.

Setting the Cluster Context

The SageMaker HyperPod CLI and SDK interact with the Kubernetes API, and you need to set the cluster context first. Use the CLI to list available clusters:

hyp list-cluster

To set the cluster context, specify the cluster name as input:

hyp set-cluster-context --cluster-name ml-cluster

Training Models with SageMaker HyperPod

You can submit PyTorch model training and fine-tuning jobs to a SageMaker HyperPod cluster using the HyperPod CLI. For example, scheduling a training job for a Meta Llama 3.1 8B model can be done succinctly:

hyp create hyp-pytorch-job \
    --job-name fsdp-llama3-1-8b \
    --image ${REGISTRY}fsdp:pytorch2.7.1 \
    --command '[hyperpodrun, --tee=3, --log_dir=/tmp/hyperpod, --nproc_per_node=1, --nnodes=8, /fsdp/train.py]' \
    --args '[...]' \
    --environment '{"PYTORCH_CUDA_ALLOC_CONF": "max_split_size_mb:32"}' \
    ...

Monitoring Training Jobs

You can monitor the status of your training job with:

hyp list hyp-pytorch-job

To check the logs of a specific pod:

hyp get-logs hyp-pytorch-job --pod-name fsdp-llama3-1-8b-pod-0

Deploying Models with SageMaker HyperPod

The CLI also offers commands for quick deployment of models for inference, making it easy to deploy both foundation models and custom models.

Deploying SageMaker JumpStart Models

To deploy an FM from SageMaker JumpStart, run:

hyp create hyp-jumpstart-endpoint \
    --model-id deepseek-llm-r1-distill-qwen-1-5b \
    --instance-type ml.g5.8xlarge \
    ...

Deploying Custom Models

To deploy custom models with artifacts stored on Amazon S3 or FSx for Lustre, you can use a command like:

hyp create hyp-custom-endpoint \
    --endpoint-name my-custom-tinyllama-endpoint \
    --model-name tinyllama \
    --model-source-type s3 \
    ...

Clean-Up

To delete your training job and model deployments, use:

hyp delete hyp-pytorch-job --job-name fsdp-llama3-1-8b
hyp delete hyp-jumpstart-endpoint --name deepseek-distill-qwen-endpoint-cli

Conclusion

The Amazon SageMaker HyperPod CLI and SDK present significant improvements for those working on AI models. They simplify the workflows for training and deploying large-scale AI applications, enabling rapid experimentation and robust production-ready deployments. With the new tools, data scientists can streamline their processes and focus on what truly matters: developing innovative AI solutions.

About the Authors

Giuseppe Angelo Porcelli is a Principal Machine Learning Specialist Solutions Architect for AWS, focusing on delivering AI and ML solutions.

Shweta Singh, a Senior Product Manager at AWS, specializes in the SageMaker Python SDK.

Nicolas Jourdan is a Specialist Solutions Architect at AWS, renowned for his expertise in AI and ML applications across diverse industries.

For further details and examples, check out the SageMaker HyperPod documentation and explore the resources available.

Exclusive Content:

Utilize the New HyperPod CLI and SDK to Train and Deploy Models on Amazon SageMaker HyperPod

Simplifying Distributed Training and Model Deployment with Amazon SageMaker HyperPod CLI and SDK

Introduction

Key Features of the SageMaker HyperPod CLI and SDK

Enhanced User Experience

Advanced Programmatic Control

Practical Applications

Prerequisites

Installation Guide

Install the SageMaker HyperPod CLI

Setting the Cluster Context

Training Models with SageMaker HyperPod CLI and SDK

Submitting Training Jobs

Debugging Training Jobs

Deploying Models with SageMaker HyperPod

Deploying SageMaker JumpStart Models

Deploying Custom Models

Conclusion

Simplifying the Training and Deployment of Large AI Models with Amazon SageMaker HyperPod

What is Amazon SageMaker HyperPod?

Features of SageMaker HyperPod CLI

Advanced Customization with SageMaker HyperPod SDK

Getting Started: Prerequisites

Installing the SageMaker HyperPod CLI

Setting the Cluster Context

Training Models with SageMaker HyperPod

Monitoring Training Jobs

Deploying Models with SageMaker HyperPod

Deploying SageMaker JumpStart Models

Deploying Custom Models

Clean-Up

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe