Deploying Small Language Models on Amazon SageMaker AI with Graviton Processors

Introduction to AI and Language Models

Solution Overview

Prerequisites

Creating a Docker Container for ARM64

Running Your Container During Hosting

Container Components for the Example

Preparing Your Model and Inference Code

Deploying Your Model on SageMaker

Performance Optimization Discussion

Cleaning Up Resources

Conclusion

About the Authors

Leveraging AWS SageMaker AI with Graviton Processors for Efficient Language Model Deployment

As organizations increasingly incorporate AI capabilities into their applications, large language models (LLMs) have surfaced as powerful tools for natural language processing tasks. AWS SageMaker AI provides a fully managed service for deploying machine learning models with various inference options, enabling organizations to optimize for cost, latency, and throughput.

The Power of Choice in AI

AWS has always empowered customers with flexibility, including choices around models, hardware, and tooling. In addition to NVIDIA GPUs and AWS’s custom AI chips, CPU-based instances have become essential for running generative AI tasks, particularly with the evolution of CPU hardware. This allows organizations to host smaller language models and asynchronous agents without incurring substantial infrastructure costs.

Challenges with Traditional LLMs

Traditionally, LLMs with billions of parameters require substantial computational resources. For instance, 7-billion-parameter models like Meta’s Llama 7B may demand around 14 GB of GPU memory for model weights, with total GPU memory needs significant at longer sequence lengths.

However, advancements in model quantization and knowledge distillation have enabled the success of smaller, efficient language models that can run effectively on CPU infrastructure. Though these models may not rival the largest LLMs in capabilities, they serve as practical and cost-effective alternatives for many applications.

Deploying a Small Language Model Using SageMaker AI

This post showcases how to deploy a small language model on AWS SageMaker AI by extending pre-built containers compatible with AWS Graviton instances. Here’s a glimpse into the solution:

Solution Overview

Our solution leverages SageMaker AI with Graviton3 processors to run small language models cost-efficiently. The key components include:

SageMaker AI hosted endpoints for model serving
Graviton3 based instances (ml.c7g series) for computation
A container installed with llama.cpp for optimized inference
Pre-quantized GGUF format models

Graviton processors are tailored for cloud workloads, allowing for optimal performance of quantized models. These instances can yield up to 50% better price-performance compared to traditional x86 CPU instances for ML inference tasks.

Using Llama.cpp for Inference

We utilize Llama.cpp as our inference framework, which supports quantized general matrix multiply-add (GEMM) kernels optimized for Graviton processors using Arm Neon and SVE instructions. The GGUF format, designed for efficient model storage, facilitates quick loading and saving, enhancing the performance of inference tasks.

Implementation Steps

To deploy your model on SageMaker with Graviton, follow these steps:

Create a Docker container compatible with ARM64 architecture.
Prepare your model and inference code.
Create a SageMaker model and deploy it to an endpoint with a Graviton instance.

Prerequisites

An AWS account is required with the necessary permissions to implement this solution.

Step 1: Create a Docker Container

SageMaker AI operates seamlessly with Docker containers. By packaging your algorithm within a container, you can bring a variety of code to the SageMaker environment. For more detailed instructions on building your Docker container, check out the AWS documentation.

Customizing a Pre-built Container

Instead of creating a new image from scratch, consider extending a pre-built container to suit your needs. This method allows you to harness included deep learning libraries while making necessary modifications.

Here’s a sample container directory setup:

.
|-- Dockerfile
|-- build_and_push.sh
|-- code
    |-- inference.py
    |-- requirements.txt

Hosting Your Container

When hosting, SageMaker’s containers respond to inference requests via specific endpoints:

/ping confirms the container is active.
/invocations allows for inference requests.

Building a Graviton-Compatible Image

The Dockerfile should start from the SageMaker PyTorch image that supports Graviton:

FROM 763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker

Optimization Tips

Utilize compile flags: -mcpu=native -fopenmp for Graviton enhancements.
Set n_threads in the inference code to maximize CPU usage.
Use quantized models to minimize memory footprint.

Step 2: Prepare Your Model and Inference Code

An inference script defines vital functions:

model_fn(): Loads model weights from the specified directory.
input_fn(): Formats incoming user requests.
predict_fn(): Executes the inference call.
output_fn(): Serializes responses for user delivery.

Step 3: Create a SageMaker Model

Utilize the SageMaker Python SDK to create a model and deploy it with a Graviton instance:

pytorch_model = PyTorchModel(
    model_data={"S3DataSource": {"S3Uri": model_path, "S3DataType": "S3Prefix", "CompressionType": "None"}},
    role=role,
    env={'MODEL_FILE_GGUF': file_name},
    image_uri=f"{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama-cpp-python:latest",
    model_server_workers=2
)

predictor = pytorch_model.deploy(instance_type="ml.c7g.12xlarge", initial_instance_count=1)

Performance Optimization Metrics

When serving LLMs, focus on two key metrics:

Latency: The time taken to process requests.
Throughput: Number of tokens processed per second.

Consider utilizing techniques like request batching and prompt caching to optimize performance, which can significantly enhance throughput while managing latency effectively.

Conclusion

SageMaker AI with Graviton processors provides a compelling solution for organizations aiming to deploy AI capabilities efficiently. By employing CPU-based inference with quantized models, organizations can achieve substantial cost savings without sacrificing performance.

Explore our sample notebooks on GitHub and reference documentation to see if this approach aligns with your needs. To dive deeper, refer to the AWS Graviton Technical Guide for optimized libraries and best practices.

About the Authors

Vincent Wang, Andrew Smith, Melanie Li, PhD, Oussama Maxime Kandakji, and Romain Legret are experts at AWS, specializing in solutions architecture, generative AI, and efficient compute. Their insights help organizations navigate the complex landscape of AI and machine learning.

Exclusive Content:

Optimize Small Language Models Cost-Effectively Using AWS Graviton and Amazon SageMaker AI

Deploying Small Language Models on Amazon SageMaker AI with Graviton Processors

Introduction to AI and Language Models

Solution Overview

Prerequisites

Creating a Docker Container for ARM64

Running Your Container During Hosting

Container Components for the Example

Preparing Your Model and Inference Code

Deploying Your Model on SageMaker

Performance Optimization Discussion

Cleaning Up Resources

Conclusion

About the Authors

Leveraging AWS SageMaker AI with Graviton Processors for Efficient Language Model Deployment

The Power of Choice in AI

Challenges with Traditional LLMs

Deploying a Small Language Model Using SageMaker AI

Solution Overview

Using Llama.cpp for Inference

Implementation Steps

Prerequisites

Step 1: Create a Docker Container

Customizing a Pre-built Container

Hosting Your Container

Building a Graviton-Compatible Image

Optimization Tips

Step 2: Prepare Your Model and Inference Code

Step 3: Create a SageMaker Model

Performance Optimization Metrics

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe