Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Optimize Small Language Models Cost-Effectively Using AWS Graviton and Amazon SageMaker AI

Deploying Small Language Models on Amazon SageMaker AI with Graviton Processors

Introduction to AI and Language Models

Solution Overview

Prerequisites

Creating a Docker Container for ARM64

Running Your Container During Hosting

Container Components for the Example

Preparing Your Model and Inference Code

Deploying Your Model on SageMaker

Performance Optimization Discussion

Cleaning Up Resources

Conclusion

About the Authors

Leveraging AWS SageMaker AI with Graviton Processors for Efficient Language Model Deployment

As organizations increasingly incorporate AI capabilities into their applications, large language models (LLMs) have surfaced as powerful tools for natural language processing tasks. AWS SageMaker AI provides a fully managed service for deploying machine learning models with various inference options, enabling organizations to optimize for cost, latency, and throughput.

The Power of Choice in AI

AWS has always empowered customers with flexibility, including choices around models, hardware, and tooling. In addition to NVIDIA GPUs and AWS’s custom AI chips, CPU-based instances have become essential for running generative AI tasks, particularly with the evolution of CPU hardware. This allows organizations to host smaller language models and asynchronous agents without incurring substantial infrastructure costs.

Challenges with Traditional LLMs

Traditionally, LLMs with billions of parameters require substantial computational resources. For instance, 7-billion-parameter models like Meta’s Llama 7B may demand around 14 GB of GPU memory for model weights, with total GPU memory needs significant at longer sequence lengths.

However, advancements in model quantization and knowledge distillation have enabled the success of smaller, efficient language models that can run effectively on CPU infrastructure. Though these models may not rival the largest LLMs in capabilities, they serve as practical and cost-effective alternatives for many applications.

Deploying a Small Language Model Using SageMaker AI

This post showcases how to deploy a small language model on AWS SageMaker AI by extending pre-built containers compatible with AWS Graviton instances. Here’s a glimpse into the solution:

Solution Overview

Our solution leverages SageMaker AI with Graviton3 processors to run small language models cost-efficiently. The key components include:

  • SageMaker AI hosted endpoints for model serving
  • Graviton3 based instances (ml.c7g series) for computation
  • A container installed with llama.cpp for optimized inference
  • Pre-quantized GGUF format models

Graviton processors are tailored for cloud workloads, allowing for optimal performance of quantized models. These instances can yield up to 50% better price-performance compared to traditional x86 CPU instances for ML inference tasks.

Using Llama.cpp for Inference

We utilize Llama.cpp as our inference framework, which supports quantized general matrix multiply-add (GEMM) kernels optimized for Graviton processors using Arm Neon and SVE instructions. The GGUF format, designed for efficient model storage, facilitates quick loading and saving, enhancing the performance of inference tasks.

Implementation Steps

To deploy your model on SageMaker with Graviton, follow these steps:

  1. Create a Docker container compatible with ARM64 architecture.
  2. Prepare your model and inference code.
  3. Create a SageMaker model and deploy it to an endpoint with a Graviton instance.

Prerequisites

An AWS account is required with the necessary permissions to implement this solution.

Step 1: Create a Docker Container

SageMaker AI operates seamlessly with Docker containers. By packaging your algorithm within a container, you can bring a variety of code to the SageMaker environment. For more detailed instructions on building your Docker container, check out the AWS documentation.

Customizing a Pre-built Container

Instead of creating a new image from scratch, consider extending a pre-built container to suit your needs. This method allows you to harness included deep learning libraries while making necessary modifications.

Here’s a sample container directory setup:

.
|-- Dockerfile
|-- build_and_push.sh
|-- code
    |-- inference.py
    |-- requirements.txt

Hosting Your Container

When hosting, SageMaker’s containers respond to inference requests via specific endpoints:

  • /ping confirms the container is active.
  • /invocations allows for inference requests.

Building a Graviton-Compatible Image

The Dockerfile should start from the SageMaker PyTorch image that supports Graviton:

FROM 763104351884.dkr.ecr.{region}.amazonaws.com/pytorch-inference-arm64:2.5.1-cpu-py311-ubuntu22.04-sagemaker

Optimization Tips

  • Utilize compile flags: -mcpu=native -fopenmp for Graviton enhancements.
  • Set n_threads in the inference code to maximize CPU usage.
  • Use quantized models to minimize memory footprint.

Step 2: Prepare Your Model and Inference Code

An inference script defines vital functions:

  • model_fn(): Loads model weights from the specified directory.
  • input_fn(): Formats incoming user requests.
  • predict_fn(): Executes the inference call.
  • output_fn(): Serializes responses for user delivery.

Step 3: Create a SageMaker Model

Utilize the SageMaker Python SDK to create a model and deploy it with a Graviton instance:

pytorch_model = PyTorchModel(
    model_data={"S3DataSource": {"S3Uri": model_path, "S3DataType": "S3Prefix", "CompressionType": "None"}},
    role=role,
    env={'MODEL_FILE_GGUF': file_name},
    image_uri=f"{sagemaker_session.account_id()}.dkr.ecr.{region}.amazonaws.com/llama-cpp-python:latest",
    model_server_workers=2
)

predictor = pytorch_model.deploy(instance_type="ml.c7g.12xlarge", initial_instance_count=1)

Performance Optimization Metrics

When serving LLMs, focus on two key metrics:

  • Latency: The time taken to process requests.
  • Throughput: Number of tokens processed per second.

Consider utilizing techniques like request batching and prompt caching to optimize performance, which can significantly enhance throughput while managing latency effectively.

Conclusion

SageMaker AI with Graviton processors provides a compelling solution for organizations aiming to deploy AI capabilities efficiently. By employing CPU-based inference with quantized models, organizations can achieve substantial cost savings without sacrificing performance.

Explore our sample notebooks on GitHub and reference documentation to see if this approach aligns with your needs. To dive deeper, refer to the AWS Graviton Technical Guide for optimized libraries and best practices.

About the Authors

Vincent Wang, Andrew Smith, Melanie Li, PhD, Oussama Maxime Kandakji, and Romain Legret are experts at AWS, specializing in solutions architecture, generative AI, and efficient compute. Their insights help organizations navigate the complex landscape of AI and machine learning.

Latest

Create an AI-Driven Proactive Cost Management System for Amazon Bedrock – Part 1

Proactively Managing Costs in Amazon Bedrock: Implementing a Cost...

I Tested ChatGPT’s Atlas Browser as a Competitor to Google

OpenAI's ChatGPT Atlas: A New Challenger to Traditional Browsers? OpenAI's...

Pictory AI: Rapid Text-to-Video Transformation for Content Creators | AI News Update

Revolutionizing Content Creation: The Rise of Pictory AI in...

Guillermo Del Toro Criticizes Generative AI

Guillermo del Toro Raises Alarm on AI's Impact on...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Microsoft launches new AI tool to assist finance teams with generative tasks

Microsoft Launches AI Copilot for Finance Teams in Microsoft...

Create an AI-Driven Proactive Cost Management System for Amazon Bedrock –...

Proactively Managing Costs in Amazon Bedrock: Implementing a Cost Sentry Solution Introduction to Cost Management Challenges As organizations embrace generative AI powered by Amazon Bedrock, they...

Designing Responsible AI for Healthcare and Life Sciences

Designing Responsible Generative AI Applications in Healthcare: A Comprehensive Guide Transforming Patient Care Through Generative AI The Importance of System-Level Policies Integrating Responsible AI Considerations Conceptual Architecture for...

Integrating Responsible AI in Prioritizing Generative AI Projects

Prioritizing Generative AI Projects: Incorporating Responsible AI Practices Responsible AI Overview Generative AI Prioritization Methodology Example Scenario: Comparing Generative AI Projects First Pass Prioritization Risk Assessment Second Pass Prioritization Conclusion About the...