Configuration Guide for Deploying Voxtral Models

Model Setup in `code/serving.properties`

Deployment Details

To deploy the Voxtral-Mini model:

option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1

To deploy the Voxtral-Small model:

option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4

Endpoint Deployment

Run the Voxtral-vLLM-BYOC-SageMaker.ipynb notebook to set up your endpoint and test various features, including text, audio, and function calling capabilities.

Docker Container Configuration

Overview

The complete Dockerfile is available in the GitHub repository, with key configurations highlighted below.

Dockerfile Snippet

# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:latest
# SageMaker Environment Setup
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Install dependencies for audio processing
RUN pip install --no-cache-dir \
"mistral_common>=1.8.1" \
librosa>=0.10.2 \
soundfile>=0.12.1 \
pydub>=0.25.1

Explanation

This Dockerfile creates a specialized container that enhances the official vLLM server with Voxtral-specific capabilities while configuring the essential SageMaker environment variables and adding required audio processing libraries. It facilitates the seamless deployment of different Voxtral variants.

Model Configurations

Configuration File Overview

Detailed model configurations are specified in the serving.properties file located in the code folder.

Key Configuration Snippet

# Model configuration
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
option.dtype=bfloat16
# Voxtral-specific settings
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
option.trust_remote_code=true
# Audio processing specifications
option.limit_mm_per_prompt=audio:8
option.mm_processor_kwargs={"audio_sampling_rate": 16000, "audio_max_length": 1800.0}
# Performance optimizations
option.enable_chunked_prefill=true
option.enable_prefix_caching=true
option.use_v2_block_manager=true

Description

This configuration file optimally sets up the Voxtral model according to Mistral’s recommendations, supporting various features like audio processing and advanced caching mechanisms for efficient inference.

Custom Inference Handler

Inference Handler Code Overview

The complete custom inference code is located in the model.py file, which is crucial for integrating FastAPI with the vLLM server.

Key Functions Snippet

# FastAPI app for SageMaker compatibility
app = FastAPI(title="Voxtral vLLM Inference Server", version="1.1.0")

# Server Initialization
def start_vllm_server():
    config = load_serving_properties()
    cmd = [
        "vllm", "serve", config.get("option.model_id"),
        "--tokenizer-mode", "mistral",
        "--config-format", "mistral",
        "--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
        "--host", "127.0.0.1",
        "--port", "8000"
    ]
    vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
    server_ready = wait_for_server()
    return server_ready

@app.post("/invocations")
async def invoke_model(request: Request):
    # Implementation for transcription and chat requests

Explanation

This FastAPI-based handler facilitates integration with the vLLM server and efficiently manages multimodal content, supporting advanced function calling features.

SageMaker Deployment Code

Deployment Code Overview

The Voxtral-vLLM-BYOC-SageMaker.ipynb notebook handles the deployment for both Voxtral models.

Code Snippet for Deployment

import boto3
import sagemaker
from sagemaker.model import Model

sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = "your-s3-bucket"
byoc_config_uri = sagemaker_session.upload_data(
    path="./code",
    bucket=bucket,
    key_prefix="voxtral-vllm-byoc/code"
)
# Model container configuration and deployment

Model Use Cases

Overview

Voxtral models support a variety of use cases, including text and audio processing.

Text-only Interaction Example

payload = {
    "messages": [
        {"role": "user", "content": "Hello! Can you tell me about the advantages of using vLLM for model inference?"}
    ],
    "max_tokens": 200,
    "temperature": 0.2,
    "top_p": 0.95
}
response = predictor.predict(payload)

Transcription-only Example

payload = {
    "transcription": {
        "audio": "https://audio.url/example.mp3",
        "language": "fr",
        "temperature": 0.0
    }
}
response = predictor.predict(payload)

Multimodal Processing Example

payload = {
    "messages": [
        {"role": "user", "content": "Can you summarize this audio file?", "type": "audio", "path": "https://audio.url/example.mp3"}
    ],
    "max_tokens": 300,
    "temperature": 0.2,
    "top_p": 0.95
}
response = predictor.predict(payload)

Tool Use Example

WEATHER_TOOL = {
    "type": "function",
    "function": {...}
}

payload = {
    "messages": [...],
    "temperature": 0.2,
    "top_p": 0.95,
    "tools": [WEATHER_TOOL]
}
response = predictor.predict(payload)

Clean Up Process

Endpoint Deletion

To delete your SageMaker endpoints and avoid added costs:

print(f"Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print("Endpoint deleted successfully")

Conclusion

This guide provides a comprehensive approach to deploying Voxtral models on SageMaker using the BYOC method, ensuring production-ready systems with state-of-the-art capabilities for text and audio processing.

Authors

Ying Hou, PhD – Specialist Solution Architect for GenAI at AWS, with expertise in deploying intelligent AI models across various platforms.

Deploying Voxtral Models with SageMaker: A Comprehensive Guide

In today’s fast-paced world of artificial intelligence and machine learning, deploying advanced models effectively can give you a significant edge. Voxtral, an innovative family of models from Mistral, offers robust capabilities for multimodal processing, encompassing everything from text interaction to sophisticated audio handling. In this blog post, we will guide you through configuring, deploying, and utilizing Voxtral models, specifically Voxtral-Mini and Voxtral-Small, on AWS SageMaker.

Configuration in `serving.properties`

Before diving into deployment, it’s crucial to set up the configuration correctly. Below are the specific configurations for the two Voxtral variants:

Voxtral-Mini Configuration

To deploy Voxtral-Mini, use the following settings in your serving.properties:

option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1

Voxtral-Small Configuration

For Voxtral-Small, the configuration will be as follows:

option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4

These configurations not only help define which model to load but ensure the models operate efficiently based on their respective sizes.

Deploying the Endpoint

After configuring your model, the next step is to deploy your endpoint. You can do this effortlessly using the Voxtral-vLLM-BYOC-SageMaker.ipynb notebook, which guides you through deploying and testing with text, audio, and function calling capabilities.

Docker Container Configuration

The GitHub repository provides a Dockerfile critical for deploying the Voxtral models. Below are the essential components of the Dockerfile:

# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:latest
# Set environment variables for SageMaker
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
# Install audio processing dependencies
RUN pip install --no-cache-dir \
"mistral_common>=1.8.1" \
librosa>=0.10.2 \
soundfile>=0.12.1 \
pydub>=0.25.1

Key Highlights:

Custom Environment: This setup creates a generic container with necessary audio processing libraries.
Dynamic Model Injections: Your model-specific code (like model.py and serving.properties) can be injected at runtime, optimizing your deployment strategy.

Model Configuration and Optimizations

In the serving.properties, you’ll find full model configurations, including optimizations tailored to Voxtral models. For instance:

option.dtype=bfloat16
option.tokenizer_mode=mistral
option.limit_mm_per_prompt=audio:8

These configurations ensure efficient processing and enable advanced features, such as supporting up to eight audio files per prompt.

Custom Inference Handler

The inference handler is vital for processing requests effectively. Below is a snippet of the FastAPI-based server implementation:

app = FastAPI(title="Voxtral vLLM Inference Server", version="1.1.0")
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
    """Start vLLM server with Voxtral-specific configuration"""
    config = load_serving_properties()
    # Command for vLLM server initiation
    cmd = [
        "vllm", "serve", config.get("option.model_id"),
        "--tokenizer-mode", "mistral",
        "--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
        "--host", "127.0.0.1",
        "--port", "8000"
    ]

This code snippet initiates the vLLM server and prepares it for incoming requests.

SageMaker Deployment Code

The provided notebook orchestrates the entire deployment process, ensuring a smooth transition from model development to production. A brief overview includes:

import boto3
import sagemaker
from sagemaker.model import Model

# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()

# Configure and deploy the model
voxtral_model = Model(
    image_uri=image_uri,
    model_data={"S3DataSource": {"S3Uri": f"{byoc_config_uri}/"}},
    role=role,
)
predictor = voxtral_model.deploy(
    initial_instance_count=1,
    instance_type="ml.g6.12xlarge", # For Voxtral-Small
)

Exploring Use Cases

Voxtral models support a plethora of use cases, including:

Text-only Interaction

payload = {
    "messages": [{"role": "user", "content": "Hello! Can you tell me about vLLM?"}],
    "max_tokens": 200,
}
response = predictor.predict(payload)

Audio Transcription

For transcription tasks, set temperature to 0 for deterministic output:

payload = {
    "transcription": {
        "audio": "Audio URL here",
        "language": "fr",
        "temperature": 0.0,
    }
}
response = predictor.predict(payload)

Multimodal Processing

Both text and audio can be processed together for complex interactions:

payload = {
    "messages": [{
        "role": "user",
        "content": ["Can you summarize this audio file?", {"type": "audio", "path": "Audio URL here"}]
    }],
}
response = predictor.predict(payload)

Tool Utilization

Voxtral also supports function calling based on input commands:

# Function configuration and usage
payload = {
    "messages": [{"role": "user", "content": [{"type": "audio", "path": "Audio URL here"}]}],
    "tools": [WEATHER_TOOL]
}
response = predictor.predict(payload)

Conclusion

Deploying Voxtral models on AWS SageMaker using the BYOC (Bring Your Own Container) approach offers a flexible architecture that can evolve with your project. From seamless text interactions to sophisticated audio processing, Voxtral empowers developers to create robust voice-enabled applications.

To explore the complete code and capabilities, visit the GitHub repository. By following this guide, you can harness the full power of voices and text, unlocking new possibilities in your AI applications.

About the Author

Ying Hou, PhD, is a Sr. Specialist Solution Architect for GenAI at AWS, dedicated to bringing the latest AI models to the AWS platform. With extensive expertise in various AI domains, Ying collaborates closely with customers to develop innovative machine learning applications.

If you’re ready to take your multimodal AI applications to the next level, dive into the Voxtral models and start building your own voice-enabled tools today!

Exclusive Content:

Deploy Voxtral by Mistral AI on Amazon SageMaker

Configuration Guide for Deploying Voxtral Models

Model Setup in code/serving.properties

Deployment Details

Endpoint Deployment

Docker Container Configuration

Overview

Dockerfile Snippet

Explanation

Model Configurations

Configuration File Overview

Key Configuration Snippet

Description

Custom Inference Handler

Inference Handler Code Overview

Key Functions Snippet

Explanation

SageMaker Deployment Code

Deployment Code Overview

Code Snippet for Deployment

Model Use Cases

Overview

Text-only Interaction Example

Transcription-only Example

Multimodal Processing Example

Tool Use Example

Clean Up Process

Endpoint Deletion

Conclusion

Authors

Deploying Voxtral Models with SageMaker: A Comprehensive Guide

Configuration in serving.properties

Voxtral-Mini Configuration

Voxtral-Small Configuration

Deploying the Endpoint

Docker Container Configuration

Key Highlights:

Model Configuration and Optimizations

Custom Inference Handler

SageMaker Deployment Code

Exploring Use Cases

Text-only Interaction

Audio Transcription

Multimodal Processing

Tool Utilization

Conclusion

About the Author

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe

Model Setup in `code/serving.properties`

Configuration in `serving.properties`