Streamlining AI Model Management with Amazon SageMaker HyperPod CLI and SDK
Simplifying Distributed Computing for Data Scientists
Overview of SageMaker HyperPod CLI and SDK
A Layered Architecture for Simplicity
Prerequisites for Using SageMaker HyperPod
Installing the SageMaker HyperPod CLI
Creating a New HyperPod Cluster
Monitoring the HyperPod Cluster Creation Process
Connecting to a Cluster
Modifying an Existing HyperPod Cluster
Deleting an Existing HyperPod Cluster
Exploring the SageMaker HyperPod SDK
Conclusion: Enhancing AI Workflow Efficiency
About the Authors
Simplifying Distributed AI Model Training with SageMaker HyperPod CLI and SDK
Training and deploying large AI models necessitates robust distributed computing capabilities. However, managing these systems shouldn’t be overly complicated for data scientists and machine learning (ML) practitioners. Enter Amazon SageMaker HyperPod, which leverages Amazon Elastic Kubernetes Service (EKS) orchestration to simplify the management of cluster infrastructure, allowing users to focus on what truly matters—building and optimizing models.
Simplified User Experience
The SageMaker HyperPod Command Line Interface (CLI) offers an intuitive command-line experience that abstracts the complexities of distributed systems. Built on top of the HyperPod SDK, the CLI provides straightforward commands that enable data scientists to manage HyperPod clusters effectively. Whether it’s launching training or fine-tuning jobs, deploying inference endpoints, or monitoring cluster performance, the HyperPod CLI facilitates quick experimentation and iteration.
A Layered Architecture for Simplicity
The HyperPod CLI and SDK utilize a multi-layered, shared architecture. Both serve as user-facing entry points and are built on consistent SDK components. This shared architecture allows for infrastructure automation, orchestrating cluster lifecycle management through AWS CloudFormation stack provisioning and AWS API interactions. Workloads, training, and integrated development environments are expressed as Kubernetes Custom Resource Definitions (CRDs), easily managed through the Kubernetes API.
In this post, we’ll explore how to use the CLI and SDK to create and manage SageMaker HyperPod clusters within your AWS account. While this piece focuses on cluster creation and management, a companion post dives deeper into submitting training jobs and deploying inference endpoints.
Prerequisites
Before following the examples provided, ensure you have the necessary prerequisites, including access to an AWS account.
Installing the SageMaker HyperPod CLI
To begin, install the latest version of the SageMaker HyperPod CLI and SDK. The commands illustrated here are based on version 3.5.0. From your local environment, execute:
pip install sagemaker-hyperpod
This command prepares the tools required to engage with SageMaker HyperPod clusters. To verify successful installation, run:
hyp
You should see output detailing available commands, confirming that the CLI has been correctly installed.
Creating a New HyperPod Cluster
Both the AWS Management Console and HyperPod CLI provide streamlined experiences for cluster creation. The console offers a guided approach, while the CLI is favored for programmatic use, enabling reproducibility and automation.
To initialize a new cluster configuration via the CLI, run:
hyp init cluster-stack
This command sets up a cluster configuration in the current directory, generating a config.yaml file where you can define your cluster specifications.
Here’s a partial view of the config.yaml file:
resource_name_prefix: hyp-eks-stack
create_hyperpod_cluster_stack: True
hyperpod_cluster_name: hyperpod-cluster
create_eks_cluster_stack: True
kubernetes_version: 1.31
Editing these configuration parameters directly or using the CLI’s hyp configure command streamlines the process further.
Submitting the Cluster Creation Stack
Once your configuration is complete, validate it:
hyp validate
After validation, you can submit the creation stack to CloudFormation with:
hyp create --region <your-region>
This command initiates the stack creation and outputs the CloudFormation stack ID upon success.
Monitoring the HyperPod Cluster Creation Process
To list existing CloudFormation stacks, use:
hyp list cluster-stack --region <your-region>
If necessary, you can filter the output by stack status. Further details about individual stacks can be accessed via:
hyp describe cluster-stack --region <your-region>
Connecting to a Cluster
Once the cluster has been created successfully, configure the CLI to communicate with your HyperPod cluster using:
hyp set-cluster-context --cluster-name <your-cluster-name> --region <your-region>
This command updates your local Kubernetes configuration, enabling you to use both the HyperPod CLI and Kubernetes utilities like kubectl for resource management.
Modifying an Existing HyperPod Cluster
The hyp update cluster command allows you to modify instance groups or change configurations such as instance types or node recovery modes.
For example:
hyp update cluster --cluster-name <your-cluster-name> --region <your-region> --instance-groups '[{"instance_count": 2, "instance_group_name": "worker", "instance_type": "ml.m5.large"}]'
Deleting an Existing HyperPod Cluster
To remove a cluster, execute:
hyp delete cluster-stack --region <your-region>
This command will prompt you to confirm the deletion, ensuring you carefully consider which resources you choose to retain.
SageMaker HyperPod SDK
For programmatic access, the SageMaker HyperPod SDK is installed along with the CLI. The SDK offers more control and flexibility, ideal for embedding HyperPod functionality directly into applications or integrating with other services.
Conclusion
The SageMaker HyperPod CLI and SDK facilitate an efficient approach to cluster creation and management, enhancing the experience for data scientists and ML engineers alike. With straightforward lifecycle management, integrated observability, and declarative control, these tools make it easier to experiment and iterate in distributed training environments.
If you want to learn how to submit training jobs and deploy models, check out our companion blog post: "Train and Deploy Models on Amazon SageMaker HyperPod using the New HyperPod CLI and SDK."
About the Authors
Nicolas Jourdan
A Specialist Solutions Architect at AWS, Nicolas brings extensive experience in AI and ML applications across diverse industries.
Andrew Brown
A Sr. Solutions Architect with a focus on deep learning and high-performance computing in the energy sector.
Giuseppe Angelo Porcelli
A Principal Machine Learning Specialist Solutions Architect at AWS, Giuseppe specializes in MLOps and various AI/ML domains.
Embrace the power of simplified distributed computing with SageMaker HyperPod, and unlock the potential of AI in your projects!