Amazon SageMaker AI Unveils OpenAI-Compatible API Support for Real-Time Inference
Overview
Today, Amazon SageMaker AI introduces OpenAI-compatible API support, enabling seamless integration with real-time inference endpoints for OpenAI SDK users.
Use Cases
Explore diverse applications such as agentic workflows, multi-model hosting, and serving fine-tuned models with minimal code changes.
Authentication and Security
Learn how to utilize bearer tokens for secure authentication with SageMaker AI endpoints, ensuring compliance with best security practices.
Deployment and Invocation
Step-by-step instructions for deploying single-model endpoints and inference components, including code examples for effective implementation.
Integration with Strands Agents
Discover how to integrate SageMaker AI with Strands Agents for building intelligent workflows that leverage OpenAI-compatible models.
Clean Up
Instructions on how to terminate resources to avoid ongoing charges after deployment.
Conclusion
With OpenAI compatibility, SageMaker AI bridges the gap between current AI applications and scalable infrastructure.
About the Authors
Meet the team behind the launch, including their roles and backgrounds in machine learning and AWS architecture.
Unlocking Real-time Inference: Amazon SageMaker AI Goes OpenAI-Compatible
Today marks a significant milestone for developers and data scientists using Amazon SageMaker AI, as it officially introduces OpenAI-compatible API support for real-time inference endpoints. With the simple adjustment of an endpoint URL, users of the OpenAI SDK, LangChain, or Strands Agents can now seamlessly invoke models hosted on SageMaker AI—no need for custom clients, SigV4 wrappers, or complex code rewrites.
Overview
This launch introduces an /openai/v1 path within SageMaker AI endpoints that facilitates Chat Completions requests, returning responses directly from the container, including support for streaming. OpenAI endpoints are automatically enabled for all inference components, making the integration straightforward for developers.
The architecture allows SageMaker AI to route requests based on the URL’s endpoint name, meaning any OpenAI-compatible client can interact with these endpoints with minimal friction. Additionally, you can create time-limited bearer tokens to use your existing OpenAI clients with SageMaker endpoints without any hassle.
User Testimonials
Giorgio Piatti, an AI/ML Engineer at Caffeine.AI, expressed the impact of this development perfectly:
“We run AI coding agents that use multiple LLM providers through an LLM gateway (Bifrost) speaking the OpenAI chat completions protocol. The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint—no custom SigV4 signing—so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients.”
Use Cases
1. Agentic Workflows on Owned Infrastructure
For those building multi-step AI agents with frameworks like Strands Agents or LangChain, the newfound ability to run entire workflows on dedicated SageMaker AI endpoints presents unparalleled advantages. Your agents can invoke models using the same OpenAI-compatible interface, while inference operations run efficiently on dedicated GPU instances in your account.
2. Multi-Model Hosting with a Single Interface
If you manage multiple models—such as Llama for general tasks, a fine-tuned Mistral for domain-specific tasks, and a smaller model for classification—you can host all of them under a single SageMaker AI endpoint. Each model will have distinct resource allocations, accessible via the same OpenAI SDK. This eliminates the need for separate API clients and complex routing logic in your application code.
3. Serving Fine-Tuned Models Without Code Changes
For businesses fine-tuning open-source models tailored for specific use cases, deploying these on SageMaker AI allows you to leverage the existing OpenAI-compatible interface your applications are accustomed to. The only alteration required is updating the endpoint URL—everything else remains unchanged, including SDK calls, streaming logic, and prompt formatting.
Implementation Walkthrough
In this post, we will cover the following:
- How bearer token authentication operates with SageMaker AI endpoints.
- Steps to deploy and invoke a single-model endpoint.
- How to set up and invoke inference components for multi-model deployments.
- Integration methods with the Strands Agents framework.
Prerequisites
To follow along, ensure you have:
- An AWS account with permissions to create SageMaker AI endpoints.
- The SageMaker Python SDK installed.
- The OpenAI Python SDK installed.
- A model stored in Amazon S3 (e.g., Qwen3-4B from Hugging Face).
- An IAM execution role with the necessary permissions to create endpoints.
Bearer Token Authentication
The new SageMaker AI OpenAI-compatible endpoints utilize bearer token authentication. This streamlined process includes a token generator that creates time-limited tokens lasting up to 12 hours using your existing AWS credentials. No extra secrets or API keys are necessary.
Example Token Generation Script
from sagemaker.core.token_generator import generate_token
from datetime import timedelta
token = generate_token(region="us-west-2", expiry=timedelta(minutes=5))
This script generates a bearer token for authentication, leveraging whatever AWS credentials are available in your environment.
Auto-refresh Tokens for Long-running Applications
Implement an auto-refresh pattern using httpx to ensure fresh tokens for long-running applications:
import httpx
from sagemaker.core.token_generator import generate_token
class SageMakerAuth(httpx.Auth):
def __init__(self, region: str):
self.region = region
def auth_flow(self, request):
request.headers["Authorization"] = f"Bearer {generate_token(region=self.region)}"
yield request
http_client = httpx.Client(auth=SageMakerAuth(region="us-west-2"))
Deploying a Single-Model Endpoint
Below is an example of deploying a Qwen3-4B model using the SageMaker AI vLLM Deep Learning Container on an ml.g6.2xlarge instance:
import boto3
import sagemaker
from time import sleep
from sagemaker.core.helper.session_helper import Session, get_execution_role
# AWS configuration
REGION = "us-west-2"
session = Session(boto_session=boto3.Session(region_name=REGION))
EXECUTION_ROLE = get_execution_role(sagemaker_session=session)
# Model details
MODEL_HF_ID = "Qwen/Qwen3-4B"
VLLM_IMAGE = f"763104351884.dkr.ecr.{REGION}.amazonaws.com/vllm:0.20.2-gpu-py312-cu130-ubuntu22.04-sagemaker"
# Create and deploy model, endpoint configuration, and endpoint
# ... (Insert the "create model", "create endpoint config", and "create endpoint" code blocks)
print("Waiting for endpoint to reach InService status (this could take 5-10 minutes)...")
waiter = session.sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=SME_ENDPOINT_NAME)
print(f"Endpoint is InService: {SME_ENDPOINT_NAME}")
After the endpoint is ready, it will facilitate both standard SageMaker AI API calls and OpenAI-compatible requests.
Invoking a Single-Model Endpoint
Once the endpoint is in service, it can be invoked using the OpenAI Python SDK. Here’s how:
from openai import OpenAI
from sagemaker.core.token_generator import generate_token
REGION = "us-west-2"
base_url = f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1"
client = OpenAI(base_url=base_url, api_key=generate_token(region=REGION))
stream = client.chat.completions.create(
model="",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Explain how transformers work in machine learning in three sentences."},
],
stream=True,
)
for chunk in stream:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="")
print()
Deploy an Inference Component Endpoint
With inference components, you can host multiple models on a single endpoint, ensuring each model gets appropriately allocated resources.
Example of Deploying Inference Components
# Code to create the model, endpoint configuration, and endpoint goes here
# Followed by the inference component creation
Integrate with Strands Agents
Strands Agents is an open-source SDK for building AI agents. With OpenAI-compatible support, you can route multi-agent workflows entirely on your SageMaker AI infrastructure without exposing your data externally.
Example Integration Code
from openai import AsyncOpenAI
from strands import Agent, tool
@tool
def calculator(expression: str) -> str:
return str(eval(expression))
# Setup the agents using Strands
Clean Up
To avoid incurring unnecessary charges, ensure you delete your endpoints and associated resources after use:
import boto3
# Cleanup code to delete endpoints, endpoint configurations, and models
Conclusion
The launch of OpenAI-compatible API support in Amazon SageMaker AI bridges the gap between existing AI applications and the scalable infrastructure they require. Developers can maintain their existing codebases while running inference on dedicated, reliable endpoints that meet GPU, scaling, and data residency demands.
To get started, simply deploy a model on a SageMaker AI real-time endpoint using a supported container, install the SageMaker Python SDK, and point your OpenAI client to the endpoint URL.
Ready to dive in? Check out the Amazon SageMaker AI Developer Guide for more details, or log into the Amazon SageMaker AI console to create your first endpoint.
About the Authors
Marc Karp
Marc is a Senior ML Architect with the Amazon SageMaker AI Service team, focusing on helping customers manage AI/ML workloads at scale.
Kareem Syed-Mohammed
Kareem is a Product Manager at AWS, specializing in generative AI model development on Amazon SageMaker.
Shrijeet Joshi
Shrijeet is a Senior Software Engineer at AWS, working on the core infrastructure of Amazon SageMaker AI’s real-time inference platform.
Dmitry Soldatkin
Dmitry is a Senior Machine Learning Solutions Architect at AWS, helping clients build AI/ML solutions across various industries.
Xu Deng
Xu is a Software Engineer Manager with the Amazon SageMaker AI team, passionate about optimizing AI/ML inference experiences.