Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Manage Amazon SageMaker HyperPod Clusters with the HyperPod CLI and SDK

Streamlining AI Model Management with Amazon SageMaker HyperPod CLI and SDK

Simplifying Distributed Computing for Data Scientists

Overview of SageMaker HyperPod CLI and SDK

A Layered Architecture for Simplicity

Prerequisites for Using SageMaker HyperPod

Installing the SageMaker HyperPod CLI

Creating a New HyperPod Cluster

Monitoring the HyperPod Cluster Creation Process

Connecting to a Cluster

Modifying an Existing HyperPod Cluster

Deleting an Existing HyperPod Cluster

Exploring the SageMaker HyperPod SDK

Conclusion: Enhancing AI Workflow Efficiency

About the Authors

Simplifying Distributed AI Model Training with SageMaker HyperPod CLI and SDK

Training and deploying large AI models necessitates robust distributed computing capabilities. However, managing these systems shouldn’t be overly complicated for data scientists and machine learning (ML) practitioners. Enter Amazon SageMaker HyperPod, which leverages Amazon Elastic Kubernetes Service (EKS) orchestration to simplify the management of cluster infrastructure, allowing users to focus on what truly matters—building and optimizing models.

Simplified User Experience

The SageMaker HyperPod Command Line Interface (CLI) offers an intuitive command-line experience that abstracts the complexities of distributed systems. Built on top of the HyperPod SDK, the CLI provides straightforward commands that enable data scientists to manage HyperPod clusters effectively. Whether it’s launching training or fine-tuning jobs, deploying inference endpoints, or monitoring cluster performance, the HyperPod CLI facilitates quick experimentation and iteration.

A Layered Architecture for Simplicity

The HyperPod CLI and SDK utilize a multi-layered, shared architecture. Both serve as user-facing entry points and are built on consistent SDK components. This shared architecture allows for infrastructure automation, orchestrating cluster lifecycle management through AWS CloudFormation stack provisioning and AWS API interactions. Workloads, training, and integrated development environments are expressed as Kubernetes Custom Resource Definitions (CRDs), easily managed through the Kubernetes API.

In this post, we’ll explore how to use the CLI and SDK to create and manage SageMaker HyperPod clusters within your AWS account. While this piece focuses on cluster creation and management, a companion post dives deeper into submitting training jobs and deploying inference endpoints.

Prerequisites

Before following the examples provided, ensure you have the necessary prerequisites, including access to an AWS account.

Installing the SageMaker HyperPod CLI

To begin, install the latest version of the SageMaker HyperPod CLI and SDK. The commands illustrated here are based on version 3.5.0. From your local environment, execute:

pip install sagemaker-hyperpod

This command prepares the tools required to engage with SageMaker HyperPod clusters. To verify successful installation, run:

hyp

You should see output detailing available commands, confirming that the CLI has been correctly installed.

Creating a New HyperPod Cluster

Both the AWS Management Console and HyperPod CLI provide streamlined experiences for cluster creation. The console offers a guided approach, while the CLI is favored for programmatic use, enabling reproducibility and automation.

To initialize a new cluster configuration via the CLI, run:

hyp init cluster-stack

This command sets up a cluster configuration in the current directory, generating a config.yaml file where you can define your cluster specifications.

Here’s a partial view of the config.yaml file:

resource_name_prefix: hyp-eks-stack
create_hyperpod_cluster_stack: True
hyperpod_cluster_name: hyperpod-cluster
create_eks_cluster_stack: True
kubernetes_version: 1.31

Editing these configuration parameters directly or using the CLI’s hyp configure command streamlines the process further.

Submitting the Cluster Creation Stack

Once your configuration is complete, validate it:

hyp validate

After validation, you can submit the creation stack to CloudFormation with:

hyp create --region <your-region>

This command initiates the stack creation and outputs the CloudFormation stack ID upon success.

Monitoring the HyperPod Cluster Creation Process

To list existing CloudFormation stacks, use:

hyp list cluster-stack --region <your-region>

If necessary, you can filter the output by stack status. Further details about individual stacks can be accessed via:

hyp describe cluster-stack --region <your-region>

Connecting to a Cluster

Once the cluster has been created successfully, configure the CLI to communicate with your HyperPod cluster using:

hyp set-cluster-context --cluster-name <your-cluster-name> --region <your-region>

This command updates your local Kubernetes configuration, enabling you to use both the HyperPod CLI and Kubernetes utilities like kubectl for resource management.

Modifying an Existing HyperPod Cluster

The hyp update cluster command allows you to modify instance groups or change configurations such as instance types or node recovery modes.

For example:

hyp update cluster --cluster-name <your-cluster-name> --region <your-region> --instance-groups '[{"instance_count": 2, "instance_group_name": "worker", "instance_type": "ml.m5.large"}]'

Deleting an Existing HyperPod Cluster

To remove a cluster, execute:

hyp delete cluster-stack --region <your-region>

This command will prompt you to confirm the deletion, ensuring you carefully consider which resources you choose to retain.

SageMaker HyperPod SDK

For programmatic access, the SageMaker HyperPod SDK is installed along with the CLI. The SDK offers more control and flexibility, ideal for embedding HyperPod functionality directly into applications or integrating with other services.

Conclusion

The SageMaker HyperPod CLI and SDK facilitate an efficient approach to cluster creation and management, enhancing the experience for data scientists and ML engineers alike. With straightforward lifecycle management, integrated observability, and declarative control, these tools make it easier to experiment and iterate in distributed training environments.

If you want to learn how to submit training jobs and deploy models, check out our companion blog post: "Train and Deploy Models on Amazon SageMaker HyperPod using the New HyperPod CLI and SDK."

About the Authors

Nicolas Jourdan

A Specialist Solutions Architect at AWS, Nicolas brings extensive experience in AI and ML applications across diverse industries.

Andrew Brown

A Sr. Solutions Architect with a focus on deep learning and high-performance computing in the energy sector.

Giuseppe Angelo Porcelli

A Principal Machine Learning Specialist Solutions Architect at AWS, Giuseppe specializes in MLOps and various AI/ML domains.


Embrace the power of simplified distributed computing with SageMaker HyperPod, and unlock the potential of AI in your projects!

Latest

I Tested the New ChatGPT Caricature Trend and Was Amazed by How Well the AI Knows Me!

The New Trend in AI Art: Caricatures and Self-Expression...

Inside Korea’s Next Growth Catalyst: How the MSS is Transforming Robotics Startups into Leaders of Physical AI – KoreaTechDesk

South Korea's Robotics Revolution: A Vision for Industrial Innovation MSS...

Time-LLM: The AI Chatbot Revolution

Time-LLM: Revolutionizing Time-Series Forecasting with Large Language Models Core Architecture...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

A Practical Guide to Using Amazon Nova Multimodal Embeddings

Harnessing the Power of Amazon Nova Multimodal Embeddings: A Comprehensive Guide Unleashing the Potential of Multimodal Applications Discover how embedding models enhance modern applications, including semantic...

Maximizing AI Agents in Businesses: Best Practices for Utilizing Amazon Bedrock...

Best Practices for Building Production-Ready AI Agents with Amazon Bedrock AgentCore Essential Strategies for Developing High-Performance AI Agents in Enterprise Settings This heading encapsulates the central...

Utilize Custom Action Connectors in Amazon Quick Suite to Upload Text...

Streamlining Secure File Uploads: Integrating Google Drive with Amazon Quick Suite A Comprehensive Guide to Building a User-Friendly Cloud Storage Solution In this post, we explore...