Introducing Managed Node Auto Scaling for Amazon SageMaker HyperPod with Karpenter
Unlocking Dynamic Scalability for Machine Learning Workloads
New Features and Benefits of Karpenter Integration
Solution Overview
Prerequisites for Implementing Karpenter
Step-by-Step: Create and Configure Your SageMaker HyperPod Cluster
Defining HyperpodNodeClass for Resource Management
Setting Up a NodePool for Optimal Node Configuration
Launching a Sample Workload on Your Cluster
Enhancing Auto Scaling with KEDA and Karpenter Integration
Cleaning Up Resources Post-Implementation
Conclusion: Optimize Your ML Workloads with Auto Scaling
About the Authors
Exciting Update: Amazon SageMaker HyperPod Adds Managed Node Auto Scaling with Karpenter
Today, we’re thrilled to announce that Amazon SageMaker HyperPod now supports managed node automatic scaling with Karpenter! This integration enhances the ability of organizations to efficiently scale their SageMaker HyperPod clusters to meet the dynamic demands of inference and training workloads.
The Need for Auto Scaling in Real-Time Inference
In the world of machine learning, real-time inference workloads are often fraught with unpredictable traffic patterns. Businesses must quickly adapt their GPU compute capacities to maintain service-level agreements (SLAs) without compromising on response times or cost-efficiency. This is where Karpenter shines, allowing automatic scaling based on demand spikes while alleviating the operational burden of self-managed solutions.
What Makes this Feature Stand Out?
This service-managed solution dramatically reduces the complexity of installing, configuring, and maintaining Karpenter controllers, offering a seamless integration with the resilience capabilities of SageMaker HyperPod. One of the standout features is the ability to scale to zero, eliminating the need for dedicated compute resources when they are not in use, thus enhancing cost-efficiency.
An Infrastructure Built for Resilience
SageMaker HyperPod offers a high-performance, resilient infrastructure, complete with observability tools optimized for large-scale model training and deployment. Organizations such as Perplexity, HippocraticAI, H.AI, and Articul8 are already leveraging HyperPod for effective model training and deployment. As more businesses transition from training foundation models (FMs) to running operational inference at scale, the requirement for automatic scaling becomes critical.
Karpenter: A Game Changer
Karpenter is an open-source Kubernetes node lifecycle manager created by AWS, designed to optimize cluster auto scaling. It efficiently addresses the needs of organizations by offering:
- Service Managed Lifecycle: Karpenter’s installation, updates, and maintenance are all managed by SageMaker HyperPod.
- Just-in-Time Provisioning: Karpenter observes pending pods and provisions required compute resources as needed.
- Workload-Aware Node Selection: It chooses optimal instance types based on pod requirements and pricing.
- Automatic Node Consolidation: Regularly evaluates cluster status for optimization opportunities.
- Integrated Resilience: Utilizes the built-in fault tolerance mechanisms of SageMaker HyperPod.
This managed Karpenter solution is seamlessly integrated into SageMaker HyperPod EKS clusters, evolving static capacity into a dynamic, cost-optimized infrastructure that scales with demand.
Setting Up Automatic Scaling
Prerequisites
To get started, ensure you have the required quotas for the instances you’ll create in the SageMaker HyperPod cluster. Also, create the necessary AWS Identity and Access Management (IAM) permissions for Karpenter.
Creating a SageMaker HyperPod Cluster
- Log into the SageMaker AI console and navigate to HyperPod clusters.
- Select "Create HyperPod cluster" and orchestrate it on Amazon EKS.
- Choose "Custom setup," enter a name, and configure instance recovery and provisioning modes.
- Submit your configuration.
Once your cluster is created, update it to enable Karpenter through the Boto3 or AWS CLI commands. Verify the enablement using the DescribeCluster API.
Creating HyperpodNodeClass
This custom resource defines constraints around instance types and availability zones. It maps pre-created instance groups in SageMaker HyperPod, guiding Karpenter in its scaling decisions.
apiVersion: karpenter.sagemaker.amazonaws.com/v1
kind: HyperpodNodeClass
metadata:
name: multiazg6
spec:
instanceGroups:
- auto-g6-az1
- auto-g6-4xaz2
Apply this configuration to your EKS cluster using kubectl.
Creating NodePool
The NodePool sets constraints on nodes that Karpenter can create. It allows you to define specific labels, taints, and instance types for optimal resource allocation.
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: gpunodepool
spec:
template:
spec:
nodeClassRef:
group: karpenter.sagemaker.amazonaws.com
kind: HyperpodNodeClass
name: multiazg6
requirements:
- key: "node.kubernetes.io/instance-type"
operator: In
values: ["ml.g6.xlarge"]
Launching a Simple Workload
Once your setup is complete, you can run a Kubernetes deployment that scales dynamically according to demand.
Advanced Auto Scaling with KEDA and Karpenter
Combining Kubernetes Event-driven Autoscaling (KEDA) with Karpenter can provide a robust two-tier auto-scaling solution. While KEDA adjusts the number of pods based on various metrics, Karpenter provisions the necessary nodes, ensuring optimal performance and cost-effectiveness.
Conclusion
With the launch of Karpenter node auto scaling on SageMaker HyperPod, machine learning workloads can now dynamically adapt to changing demands, optimizing resource utilization and cost. By enabling Karpenter in your SageMaker HyperPod clusters, you can easily scale your workloads to meet production traffic requirements.
To experience these benefits first-hand, implement Karpenter in your SageMaker HyperPod clusters today!
About the Authors
- Vivek Gangasani: Lead GenAI Specialist Solutions Architect focused on optimizing inference performance.
- Adam Stanley: Solution Architect at AWS, specialized in machine learning infrastructure.
- Kunal Jha: Principal Product Manager at AWS for SageMaker HyperPod.
- Ty Bergstrom: Software Engineer involved with HyperPod Clusters platform.
As they continue to innovate, these experts are dedicated to helping enterprises and startups scale their GenAI models effectively.