Streamlining AI Workflows with SageMaker HyperPod and SkyPilot
Introduction
This post is co-written with Zhanghao Wu, co-creator of SkyPilot. The rapid advancement of generative AI and foundation models (FMs) has significantly increased computational resource requirements for machine learning (ML) workloads. Modern ML pipelines require efficient systems for distributing workloads across accelerated compute resources while ensuring high developer productivity. Organizations need infrastructure solutions that are not only powerful but also flexible, resilient, and straightforward to manage.
The Power of SkyPilot
SkyPilot is an open-source framework that simplifies running ML workloads by providing a unified abstraction layer, enabling ML engineers to run their workloads across various compute resources without managing underlying infrastructure complexities. This framework offers a high-level interface for provisioning resources, scheduling jobs, and managing distributed training across multiple nodes.
Amazon SageMaker HyperPod: A Game Changer
Amazon SageMaker HyperPod is a purpose-built infrastructure designed for developing and deploying large-scale FMs. It not only provides flexibility to create and use custom software stacks but also optimizes performance through same spine placement of instances, ensuring built-in resiliency. The combination of SageMaker HyperPod’s resilience and SkyPilot’s efficiency offers a powerful framework for scaling generative AI workloads.
Challenges in Orchestrating Machine Learning Workloads
While Kubernetes has gained popularity for managing ML workloads, its complexity can pose significant challenges for ML engineers, especially those transitioning from traditional VM or on-premises environments. The intricate nature of Kubernetes manifests and cluster management can slow development cycles and impact resource utilization.
Synergy Between SageMaker HyperPod and SkyPilot
To tackle these challenges, we have partnered with SkyPilot to demonstrate a solution that melds the strengths of both platforms. This partnership enables customers to leverage the robust infrastructure of SageMaker HyperPod alongside a user-friendly interface that greatly reduces the learning curve for ML engineers.
Easy Implementation and Setup
Implementing this combined solution is straightforward, whether working with existing SageMaker HyperPod clusters or setting up new deployments. Our guided approach simplifies the process, helping teams swiftly establish a powerful AI development environment.
Conclusion
By integrating SageMaker HyperPod’s robust infrastructure with SkyPilot’s intuitive interface, we provide a solution that allows teams to focus on innovation rather than infrastructure complexity. This approach enhances productivity and resource utilization, making it an ideal choice for organizations of all sizes.
About the Authors
Roy Allela, Zhanghao Wu, and Ankit Anand each bring their unique expertise to this collaboration, all focused on enhancing the experience of AI development and deploying foundation models effectively.
Streamlining AI Workflows: Integrating SageMaker HyperPod with SkyPilot
This post is co-written with Zhanghao Wu, co-creator of SkyPilot.
The rapid advancement of generative AI and foundation models (FMs) has dramatically escalated the computational resource demands for machine learning (ML) workloads. As ML pipelines evolve, the need for efficient systems to distribute workloads across accelerated compute resources becomes paramount, all while ensuring high developer productivity. Organizations increasingly require infrastructure solutions that are not just powerful, but also flexible, resilient, and easy to manage.
Enter SkyPilot—an open-source framework that simplifies the execution of ML workloads. SkyPilot provides a unified abstraction layer, enabling ML engineers to run their workloads on various compute resources without the overhead of managing complex infrastructure. This framework offers a simple, high-level interface for provisioning resources, scheduling jobs, and managing distributed training across multiple nodes.
On the other hand, Amazon SageMaker HyperPod is specifically designed to develop and deploy large-scale FMs. It not only allows users to create a custom software stack but also optimizes performance through same-spine placement of instances, alongside built-in resiliency. The combination of SageMaker HyperPod’s resilience and SkyPilot’s efficiency creates a robust framework for scaling up generative AI workloads seamlessly.
In this post, we delve into how the integration of SageMaker HyperPod with SkyPilot is transforming AI development workflows. This collaboration makes our advanced GPU infrastructure more accessible to ML engineers, enhancing both productivity and resource utilization.
Challenges of Orchestrating Machine Learning Workloads
Kubernetes has gained traction for ML workloads due to its scalability and extensive open-source tooling. SageMaker HyperPod orchestrated on Amazon Elastic Kubernetes Service (Amazon EKS) merges the strengths of Kubernetes with the robust environment of SageMaker HyperPod, tailored for training large models. The Amazon EKS support in SageMaker HyperPod enhances resilience through deep health checks, automated node recovery, and job auto-resume features, ensuring uninterrupted training for extensive and long-running jobs.
Nonetheless, ML engineers transitioning from traditional VM or on-prem environments often grapple with a steep learning curve. The complexities of Kubernetes manifests and cluster management can hinder development cycles and resource utilization. Moreover, AI infrastructure teams face the dual challenge of providing advanced management tools while ensuring a user-friendly experience for ML engineers.
SageMaker HyperPod with SkyPilot
To overcome these obstacles, we’ve partnered with SkyPilot to showcase an integrated solution that harnesses the strengths of both platforms. SageMaker HyperPod excels at managing the underlying compute resources, offering a resilient infrastructure for demanding AI workloads. SkyPilot complements this by providing an intuitive layer for job management, interactive development, and team coordination.
This partnership enables us to deliver the best of both worlds: the scalable infrastructure of SageMaker HyperPod coupled with a user-friendly interface that minimizes the learning curve for ML engineers. For AI infrastructure teams, this integration offers advanced management capabilities while simplifying day-to-day operations, creating a win-win for all stakeholders.
SkyPilot empowers AI teams to efficiently run workloads across different infrastructures using a unified high-level interface. An AI engineer can specify the resource needs for their job, and SkyPilot intelligently schedules workloads on optimal infrastructure—finding available GPUs, provisioning them, running the job, and managing its lifecycle.
Solution Overview
Implementing this solution is straightforward, whether you are utilizing existing SageMaker HyperPod clusters or setting up a new deployment. For existing clusters, you can connect directly using AWS Command Line Interface (AWS CLI) commands to update your kubeconfig and verify the setup. For new deployments, we guide you through establishing the API server, creating clusters, and configuring high-performance networking options like the Elastic Fabric Adapter (EFA).
Here’s a brief overview of the next steps to run your SkyPilot jobs for multi-node distributed training on SageMaker HyperPod.
Prerequisites
Before diving in, ensure you have the following prerequisites:
- An existing SageMaker HyperPod cluster with Amazon EKS (you can create one by referring to the "Deploy Your HyperPod Cluster" documentation). For code samples, provision a single ml.p5.48xlarge instance.
- Access to the AWS CLI and kubectl command line tools.
- A Python environment for installing SkyPilot.
Create a SageMaker HyperPod Cluster
Creating an EKS cluster is simplified using a single AWS CloudFormation stack configured with a virtual private cloud (VPC) and necessary storage resources. You can manage SageMaker HyperPod clusters through the AWS Management Console or AWS CLI.
Example AWS CLI command to create the cluster:
aws sagemaker create-cluster --cli-input-json file://cluster-config.json
Connect to Your SageMaker HyperPod EKS Cluster
Use the AWS CLI to update your local kubeconfig file with credentials needed to connect to your EKS cluster:
aws eks update-kubeconfig --name $EKS_CLUSTER_NAME
Install SkyPilot with Kubernetes Support
Install SkyPilot with Kubernetes support using pip:
pip install skypilot[kubernetes]
Verify SkyPilot’s Connection to the EKS Cluster
Ensure SkyPilot can connect to your Kubernetes cluster:
sky check k8s
Discover Available GPUs
Check GPU resources in your SageMaker HyperPod cluster with:
sky show-gpus --cloud k8s
Launch an Interactive Development Environment
With SkyPilot, create an environment for interactive development:
sky launch -c dev --gpus H100
Run Training Jobs
Run distributed training jobs seamlessly. Create a job configuration file (e.g., train.yaml) and launch the job with:
sky launch -c train train.yaml
Running Multi-Node Training Jobs with EFA
To enable high-speed inter-node communication, incorporate EFA in your SkyPilot job configuration for distributed ML workloads.
Conclusion
The integration of SageMaker HyperPod with SkyPilot not only simplifies operations but also enhances productivity and resource utilization across diverse organizations. By combining the robust infrastructure capabilities of SageMaker HyperPod with SkyPilot’s intuitive interface, we allow teams to focus on innovation rather than infrastructure complexity.
For a comprehensive start, explore the SkyPilot in the Amazon EKS Support in Amazon SageMaker HyperPod workshop.
About the Authors
Roy Allela is a Senior AI/ML Specialist Solutions Architect at AWS, dedicated to helping customers train and deploy foundation models efficiently on AWS.
Zhanghao Wu is a co-creator of the SkyPilot project and a PhD graduate from UC Berkeley, focused on improving the AI experience across diverse cloud infrastructures.
Ankit Anand is a Senior Foundation Models GTM Specialist at AWS, working at the intersection of generative AI and strategic partnerships.
For more information on implementing these solutions, refer to the respective documentation and resources. Let’s continue shaping the future of AI together!