Deploying Small Language Models on Amazon SageMaker AI with Graviton Processors
Introduction to AI and Language Models
Solution Overview
Prerequisites
Creating a Docker Container for ARM64
Running Your Container During Hosting
Container Components for the Example
Preparing Your Model and Inference Code
Deploying Your Model on SageMaker
Performance Optimization Discussion
Cleaning Up Resources
Conclusion
About the Authors
Leveraging AWS SageMaker AI with Graviton Processors for Efficient Language Model Deployment
As organizations increasingly incorporate AI capabilities into their applications, large language models (LLMs) have surfaced as powerful tools for natural language processing tasks. AWS SageMaker AI provides a fully managed service for deploying machine learning models with various inference options, enabling organizations to optimize for cost, latency, and throughput.
The Power of Choice in AI
AWS has always empowered customers with flexibility, including choices around models, hardware, and tooling. In addition to NVIDIA GPUs and AWS’s custom AI chips, CPU-based instances have become essential for running generative AI tasks, particularly with the evolution of CPU hardware. This allows organizations to host smaller language models and asynchronous agents without incurring substantial infrastructure costs.
Challenges with Traditional LLMs
Traditionally, LLMs with billions of parameters require substantial computational resources. For instance, 7-billion-parameter models like Meta’s Llama 7B may demand around 14 GB of GPU memory for model weights, with total GPU memory needs significant at longer sequence lengths.
However, advancements in model quantization and knowledge distillation have enabled the success of smaller, efficient language models that can run effectively on CPU infrastructure. Though these models may not rival the largest LLMs in capabilities, they serve as practical and cost-effective alternatives for many applications.
Deploying a Small Language Model Using SageMaker AI
This post showcases how to deploy a small language model on AWS SageMaker AI by extending pre-built containers compatible with AWS Graviton instances. Here’s a glimpse into the solution:
Solution Overview
Our solution leverages SageMaker AI with Graviton3 processors to run small language models cost-efficiently. The key components include:
- SageMaker AI hosted endpoints for model serving
- Graviton3 based instances (ml.c7g series) for computation
- A container installed with
llama.cppfor optimized inference - Pre-quantized GGUF format models
Graviton processors are tailored for cloud workloads, allowing for optimal performance of quantized models. These instances can yield up to 50% better price-performance compared to traditional x86 CPU instances for ML inference tasks.
Using Llama.cpp for Inference
We utilize Llama.cpp as our inference framework, which supports quantized general matrix multiply-add (GEMM) kernels optimized for Graviton processors using Arm Neon and SVE instructions. The GGUF format, designed for efficient model storage, facilitates quick loading and saving, enhancing the performance of inference tasks.
Implementation Steps
To deploy your model on SageMaker with Graviton, follow these steps:
- Create a Docker container compatible with ARM64 architecture.
- Prepare your model and inference code.
- Create a SageMaker model and deploy it to an endpoint with a Graviton instance.
Prerequisites
An AWS account is required with the necessary permissions to implement this solution.
Step 1: Create a Docker Container
SageMaker AI operates seamlessly with Docker containers. By packaging your algorithm within a container, you can bring a variety of code to the SageMaker environment. For more detailed instructions on building your Docker container, check out the AWS documentation.
Customizing a Pre-built Container
Instead of creating a new image from scratch, consider extending a pre-built container to suit your needs. This method allows you to harness included deep learning libraries while making necessary modifications.
Here’s a sample container directory setup:
.
|-- Dockerfile
|-- build_and_push.sh
|-- code
|-- inference.py
|-- requirements.txt
Hosting Your Container
When hosting, SageMaker’s containers respond to inference requests via specific endpoints:
/pingconfirms the container is active./invocationsallows for inference requests.
Building a Graviton-Compatible Image
The Dockerfile should start from the SageMaker PyTorch image that supports Graviton:
FROM 763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker
Optimization Tips
- Utilize compile flags:
-mcpu=native -fopenmpfor Graviton enhancements. - Set
n_threadsin the inference code to maximize CPU usage. - Use quantized models to minimize memory footprint.
Step 2: Prepare Your Model and Inference Code
An inference script defines vital functions:
- model_fn(): Loads model weights from the specified directory.
- input_fn(): Formats incoming user requests.
- predict_fn(): Executes the inference call.
- output_fn(): Serializes responses for user delivery.
Step 3: Create a SageMaker Model
Utilize the SageMaker Python SDK to create a model and deploy it with a Graviton instance:
pytorch_model = PyTorchModel(
model_data={"S3DataSource": {"S3Uri": model_path, "S3DataType": "S3Prefix", "CompressionType": "None"}},
role=role,
env={'MODEL_FILE_GGUF': file_name},
image_uri=f"{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama-cpp-python:latest",
model_server_workers=2
)
predictor = pytorch_model.deploy(instance_type="ml.c7g.12xlarge", initial_instance_count=1)
Performance Optimization Metrics
When serving LLMs, focus on two key metrics:
- Latency: The time taken to process requests.
- Throughput: Number of tokens processed per second.
Consider utilizing techniques like request batching and prompt caching to optimize performance, which can significantly enhance throughput while managing latency effectively.
Conclusion
SageMaker AI with Graviton processors provides a compelling solution for organizations aiming to deploy AI capabilities efficiently. By employing CPU-based inference with quantized models, organizations can achieve substantial cost savings without sacrificing performance.
Explore our sample notebooks on GitHub and reference documentation to see if this approach aligns with your needs. To dive deeper, refer to the AWS Graviton Technical Guide for optimized libraries and best practices.
About the Authors
Vincent Wang, Andrew Smith, Melanie Li, PhD, Oussama Maxime Kandakji, and Romain Legret are experts at AWS, specializing in solutions architecture, generative AI, and efficient compute. Their insights help organizations navigate the complex landscape of AI and machine learning.