Configuration Guide for Deploying Voxtral Models
Model Setup in code/serving.properties
Deployment Details
code/serving.propertiesTo deploy the Voxtral-Mini model:
option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1
To deploy the Voxtral-Small model:
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
Endpoint Deployment
Run the Voxtral-vLLM-BYOC-SageMaker.ipynb notebook to set up your endpoint and test various features, including text, audio, and function calling capabilities.
Docker Container Configuration
Overview
The complete Dockerfile is available in the GitHub repository, with key configurations highlighted below.
Dockerfile Snippet
# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:latest
# SageMaker Environment Setup
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
ENV HF_HOME=/tmp/hf_home
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Install dependencies for audio processing
RUN pip install --no-cache-dir \
"mistral_common>=1.8.1" \
librosa>=0.10.2 \
soundfile>=0.12.1 \
pydub>=0.25.1
Explanation
This Dockerfile creates a specialized container that enhances the official vLLM server with Voxtral-specific capabilities while configuring the essential SageMaker environment variables and adding required audio processing libraries. It facilitates the seamless deployment of different Voxtral variants.
Model Configurations
Configuration File Overview
Detailed model configurations are specified in the serving.properties file located in the code folder.
Key Configuration Snippet
# Model configuration
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
option.dtype=bfloat16
# Voxtral-specific settings
option.tokenizer_mode=mistral
option.config_format=mistral
option.load_format=mistral
option.trust_remote_code=true
# Audio processing specifications
option.limit_mm_per_prompt=audio:8
option.mm_processor_kwargs={"audio_sampling_rate": 16000, "audio_max_length": 1800.0}
# Performance optimizations
option.enable_chunked_prefill=true
option.enable_prefix_caching=true
option.use_v2_block_manager=true
Description
This configuration file optimally sets up the Voxtral model according to Mistral’s recommendations, supporting various features like audio processing and advanced caching mechanisms for efficient inference.
Custom Inference Handler
Inference Handler Code Overview
The complete custom inference code is located in the model.py file, which is crucial for integrating FastAPI with the vLLM server.
Key Functions Snippet
# FastAPI app for SageMaker compatibility
app = FastAPI(title="Voxtral vLLM Inference Server", version="1.1.0")
# Server Initialization
def start_vllm_server():
config = load_serving_properties()
cmd = [
"vllm", "serve", config.get("option.model_id"),
"--tokenizer-mode", "mistral",
"--config-format", "mistral",
"--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
"--host", "127.0.0.1",
"--port", "8000"
]
vllm_server_process = subprocess.Popen(cmd, env=vllm_env)
server_ready = wait_for_server()
return server_ready
@app.post("/invocations")
async def invoke_model(request: Request):
# Implementation for transcription and chat requests
Explanation
This FastAPI-based handler facilitates integration with the vLLM server and efficiently manages multimodal content, supporting advanced function calling features.
SageMaker Deployment Code
Deployment Code Overview
The Voxtral-vLLM-BYOC-SageMaker.ipynb notebook handles the deployment for both Voxtral models.
Code Snippet for Deployment
import boto3
import sagemaker
from sagemaker.model import Model
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
bucket = "your-s3-bucket"
byoc_config_uri = sagemaker_session.upload_data(
path="./code",
bucket=bucket,
key_prefix="voxtral-vllm-byoc/code"
)
# Model container configuration and deployment
Model Use Cases
Overview
Voxtral models support a variety of use cases, including text and audio processing.
Text-only Interaction Example
payload = {
"messages": [
{"role": "user", "content": "Hello! Can you tell me about the advantages of using vLLM for model inference?"}
],
"max_tokens": 200,
"temperature": 0.2,
"top_p": 0.95
}
response = predictor.predict(payload)
Transcription-only Example
payload = {
"transcription": {
"audio": "https://audio.url/example.mp3",
"language": "fr",
"temperature": 0.0
}
}
response = predictor.predict(payload)
Multimodal Processing Example
payload = {
"messages": [
{"role": "user", "content": "Can you summarize this audio file?", "type": "audio", "path": "https://audio.url/example.mp3"}
],
"max_tokens": 300,
"temperature": 0.2,
"top_p": 0.95
}
response = predictor.predict(payload)
Tool Use Example
WEATHER_TOOL = {
"type": "function",
"function": {...}
}
payload = {
"messages": [...],
"temperature": 0.2,
"top_p": 0.95,
"tools": [WEATHER_TOOL]
}
response = predictor.predict(payload)
Clean Up Process
Endpoint Deletion
To delete your SageMaker endpoints and avoid added costs:
print(f"Deleting endpoint: {endpoint_name}")
predictor.delete_endpoint(delete_endpoint_config=True)
print("Endpoint deleted successfully")
Conclusion
This guide provides a comprehensive approach to deploying Voxtral models on SageMaker using the BYOC method, ensuring production-ready systems with state-of-the-art capabilities for text and audio processing.
Authors
Ying Hou, PhD – Specialist Solution Architect for GenAI at AWS, with expertise in deploying intelligent AI models across various platforms.
Deploying Voxtral Models with SageMaker: A Comprehensive Guide
In today’s fast-paced world of artificial intelligence and machine learning, deploying advanced models effectively can give you a significant edge. Voxtral, an innovative family of models from Mistral, offers robust capabilities for multimodal processing, encompassing everything from text interaction to sophisticated audio handling. In this blog post, we will guide you through configuring, deploying, and utilizing Voxtral models, specifically Voxtral-Mini and Voxtral-Small, on AWS SageMaker.
Configuration in serving.properties
Before diving into deployment, it’s crucial to set up the configuration correctly. Below are the specific configurations for the two Voxtral variants:
Voxtral-Mini Configuration
To deploy Voxtral-Mini, use the following settings in your serving.properties:
option.model_id=mistralai/Voxtral-Mini-3B-2507
option.tensor_parallel_degree=1
Voxtral-Small Configuration
For Voxtral-Small, the configuration will be as follows:
option.model_id=mistralai/Voxtral-Small-24B-2507
option.tensor_parallel_degree=4
These configurations not only help define which model to load but ensure the models operate efficiently based on their respective sizes.
Deploying the Endpoint
After configuring your model, the next step is to deploy your endpoint. You can do this effortlessly using the Voxtral-vLLM-BYOC-SageMaker.ipynb notebook, which guides you through deploying and testing with text, audio, and function calling capabilities.
Docker Container Configuration
The GitHub repository provides a Dockerfile critical for deploying the Voxtral models. Below are the essential components of the Dockerfile:
# Custom vLLM Container for Voxtral Model Deployment on SageMaker
FROM --platform=linux/amd64 vllm/vllm-openai:latest
# Set environment variables for SageMaker
ENV MODEL_CACHE_DIR=/opt/ml/model
ENV TRANSFORMERS_CACHE=/tmp/transformers_cache
# Install audio processing dependencies
RUN pip install --no-cache-dir \
"mistral_common>=1.8.1" \
librosa>=0.10.2 \
soundfile>=0.12.1 \
pydub>=0.25.1
Key Highlights:
- Custom Environment: This setup creates a generic container with necessary audio processing libraries.
- Dynamic Model Injections: Your model-specific code (like
model.pyandserving.properties) can be injected at runtime, optimizing your deployment strategy.
Model Configuration and Optimizations
In the serving.properties, you’ll find full model configurations, including optimizations tailored to Voxtral models. For instance:
option.dtype=bfloat16
option.tokenizer_mode=mistral
option.limit_mm_per_prompt=audio:8
These configurations ensure efficient processing and enable advanced features, such as supporting up to eight audio files per prompt.
Custom Inference Handler
The inference handler is vital for processing requests effectively. Below is a snippet of the FastAPI-based server implementation:
app = FastAPI(title="Voxtral vLLM Inference Server", version="1.1.0")
model_engine = None
# vLLM Server Initialization for Voxtral
def start_vllm_server():
"""Start vLLM server with Voxtral-specific configuration"""
config = load_serving_properties()
# Command for vLLM server initiation
cmd = [
"vllm", "serve", config.get("option.model_id"),
"--tokenizer-mode", "mistral",
"--tensor-parallel-size", config.get("option.tensor_parallel_degree"),
"--host", "127.0.0.1",
"--port", "8000"
]
This code snippet initiates the vLLM server and prepares it for incoming requests.
SageMaker Deployment Code
The provided notebook orchestrates the entire deployment process, ensuring a smooth transition from model development to production. A brief overview includes:
import boto3
import sagemaker
from sagemaker.model import Model
# Initialize SageMaker session
sagemaker_session = sagemaker.Session()
role = sagemaker.get_execution_role()
# Configure and deploy the model
voxtral_model = Model(
image_uri=image_uri,
model_data={"S3DataSource": {"S3Uri": f"{byoc_config_uri}/"}},
role=role,
)
predictor = voxtral_model.deploy(
initial_instance_count=1,
instance_type="ml.g6.12xlarge", # For Voxtral-Small
)
Exploring Use Cases
Voxtral models support a plethora of use cases, including:
Text-only Interaction
payload = {
"messages": [{"role": "user", "content": "Hello! Can you tell me about vLLM?"}],
"max_tokens": 200,
}
response = predictor.predict(payload)
Audio Transcription
For transcription tasks, set temperature to 0 for deterministic output:
payload = {
"transcription": {
"audio": "Audio URL here",
"language": "fr",
"temperature": 0.0,
}
}
response = predictor.predict(payload)
Multimodal Processing
Both text and audio can be processed together for complex interactions:
payload = {
"messages": [{
"role": "user",
"content": ["Can you summarize this audio file?", {"type": "audio", "path": "Audio URL here"}]
}],
}
response = predictor.predict(payload)
Tool Utilization
Voxtral also supports function calling based on input commands:
# Function configuration and usage
payload = {
"messages": [{"role": "user", "content": [{"type": "audio", "path": "Audio URL here"}]}],
"tools": [WEATHER_TOOL]
}
response = predictor.predict(payload)
Conclusion
Deploying Voxtral models on AWS SageMaker using the BYOC (Bring Your Own Container) approach offers a flexible architecture that can evolve with your project. From seamless text interactions to sophisticated audio processing, Voxtral empowers developers to create robust voice-enabled applications.
To explore the complete code and capabilities, visit the GitHub repository. By following this guide, you can harness the full power of voices and text, unlocking new possibilities in your AI applications.
About the Author
Ying Hou, PhD, is a Sr. Specialist Solution Architect for GenAI at AWS, dedicated to bringing the latest AI models to the AWS platform. With extensive expertise in various AI domains, Ying collaborates closely with customers to develop innovative machine learning applications.
If you’re ready to take your multimodal AI applications to the next level, dive into the Voxtral models and start building your own voice-enabled tools today!