Amazon SageMaker AI Unveils OpenAI-Compatible API Support for Real-Time Inference

Overview

Today, Amazon SageMaker AI introduces OpenAI-compatible API support, enabling seamless integration with real-time inference endpoints for OpenAI SDK users.

Use Cases

Explore diverse applications such as agentic workflows, multi-model hosting, and serving fine-tuned models with minimal code changes.

Authentication and Security

Learn how to utilize bearer tokens for secure authentication with SageMaker AI endpoints, ensuring compliance with best security practices.

Deployment and Invocation

Step-by-step instructions for deploying single-model endpoints and inference components, including code examples for effective implementation.

Integration with Strands Agents

Discover how to integrate SageMaker AI with Strands Agents for building intelligent workflows that leverage OpenAI-compatible models.

Clean Up

Instructions on how to terminate resources to avoid ongoing charges after deployment.

Conclusion

With OpenAI compatibility, SageMaker AI bridges the gap between current AI applications and scalable infrastructure.

About the Authors

Meet the team behind the launch, including their roles and backgrounds in machine learning and AWS architecture.

Unlocking Real-time Inference: Amazon SageMaker AI Goes OpenAI-Compatible

Today marks a significant milestone for developers and data scientists using Amazon SageMaker AI, as it officially introduces OpenAI-compatible API support for real-time inference endpoints. With the simple adjustment of an endpoint URL, users of the OpenAI SDK, LangChain, or Strands Agents can now seamlessly invoke models hosted on SageMaker AI—no need for custom clients, SigV4 wrappers, or complex code rewrites.

Overview

This launch introduces an /openai/v1 path within SageMaker AI endpoints that facilitates Chat Completions requests, returning responses directly from the container, including support for streaming. OpenAI endpoints are automatically enabled for all inference components, making the integration straightforward for developers.

The architecture allows SageMaker AI to route requests based on the URL’s endpoint name, meaning any OpenAI-compatible client can interact with these endpoints with minimal friction. Additionally, you can create time-limited bearer tokens to use your existing OpenAI clients with SageMaker endpoints without any hassle.

User Testimonials

Giorgio Piatti, an AI/ML Engineer at Caffeine.AI, expressed the impact of this development perfectly:

“We run AI coding agents that use multiple LLM providers through an LLM gateway (Bifrost) speaking the OpenAI chat completions protocol. The bearer token feature lets us add SageMaker as a drop-in OpenAI-compatible inference endpoint—no custom SigV4 signing—so it works natively with our gateway, Vercel AI SDK, and standard OpenAI clients.”

Use Cases

1. Agentic Workflows on Owned Infrastructure

For those building multi-step AI agents with frameworks like Strands Agents or LangChain, the newfound ability to run entire workflows on dedicated SageMaker AI endpoints presents unparalleled advantages. Your agents can invoke models using the same OpenAI-compatible interface, while inference operations run efficiently on dedicated GPU instances in your account.

2. Multi-Model Hosting with a Single Interface

If you manage multiple models—such as Llama for general tasks, a fine-tuned Mistral for domain-specific tasks, and a smaller model for classification—you can host all of them under a single SageMaker AI endpoint. Each model will have distinct resource allocations, accessible via the same OpenAI SDK. This eliminates the need for separate API clients and complex routing logic in your application code.

3. Serving Fine-Tuned Models Without Code Changes

For businesses fine-tuning open-source models tailored for specific use cases, deploying these on SageMaker AI allows you to leverage the existing OpenAI-compatible interface your applications are accustomed to. The only alteration required is updating the endpoint URL—everything else remains unchanged, including SDK calls, streaming logic, and prompt formatting.

Implementation Walkthrough

In this post, we will cover the following:

How bearer token authentication operates with SageMaker AI endpoints.
Steps to deploy and invoke a single-model endpoint.
How to set up and invoke inference components for multi-model deployments.
Integration methods with the Strands Agents framework.

Prerequisites

To follow along, ensure you have:

An AWS account with permissions to create SageMaker AI endpoints.
The SageMaker Python SDK installed.
The OpenAI Python SDK installed.
A model stored in Amazon S3 (e.g., Qwen3-4B from Hugging Face).
An IAM execution role with the necessary permissions to create endpoints.

Bearer Token Authentication

The new SageMaker AI OpenAI-compatible endpoints utilize bearer token authentication. This streamlined process includes a token generator that creates time-limited tokens lasting up to 12 hours using your existing AWS credentials. No extra secrets or API keys are necessary.

Example Token Generation Script

from sagemaker.core.token_generator import generate_token
from datetime import timedelta

token = generate_token(region="us-west-2", expiry=timedelta(minutes=5))

This script generates a bearer token for authentication, leveraging whatever AWS credentials are available in your environment.

Auto-refresh Tokens for Long-running Applications

Implement an auto-refresh pattern using httpx to ensure fresh tokens for long-running applications:

import httpx
from sagemaker.core.token_generator import generate_token

class SageMakerAuth(httpx.Auth):
    def __init__(self, region: str):
        self.region = region

    def auth_flow(self, request):
        request.headers["Authorization"] = f"Bearer {generate_token(region=self.region)}"
        yield request

http_client = httpx.Client(auth=SageMakerAuth(region="us-west-2"))

Deploying a Single-Model Endpoint

Below is an example of deploying a Qwen3-4B model using the SageMaker AI vLLM Deep Learning Container on an ml.g6.2xlarge instance:

import boto3
import sagemaker
from time import sleep
from sagemaker.core.helper.session_helper import Session, get_execution_role

# AWS configuration
REGION = "us-west-2"
session = Session(boto_session=boto3.Session(region_name=REGION))
EXECUTION_ROLE = get_execution_role(sagemaker_session=session)

# Model details
MODEL_HF_ID = "Qwen/Qwen3-4B"
VLLM_IMAGE = f"763104351884.dkr.ecr.{REGION}.amazonaws.com/vllm:0.20.2-gpu-py312-cu130-ubuntu22.04-sagemaker"

# Create and deploy model, endpoint configuration, and endpoint
# ... (Insert the "create model", "create endpoint config", and "create endpoint" code blocks)

print("Waiting for endpoint to reach InService status (this could take 5-10 minutes)...")
waiter = session.sagemaker_client.get_waiter("endpoint_in_service")
waiter.wait(EndpointName=SME_ENDPOINT_NAME)
print(f"Endpoint is InService: {SME_ENDPOINT_NAME}")

After the endpoint is ready, it will facilitate both standard SageMaker AI API calls and OpenAI-compatible requests.

Invoking a Single-Model Endpoint

Once the endpoint is in service, it can be invoked using the OpenAI Python SDK. Here’s how:

from openai import OpenAI
from sagemaker.core.token_generator import generate_token

REGION = "us-west-2"
base_url = f"https://runtime.sagemaker.{REGION}.amazonaws.com/endpoints/{SME_ENDPOINT_NAME}/openai/v1"

client = OpenAI(base_url=base_url, api_key=generate_token(region=REGION))

stream = client.chat.completions.create(
    model="",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain how transformers work in machine learning in three sentences."},
    ],
    stream=True,
)

for chunk in stream:
    if chunk.choices[0].delta.content:
        print(chunk.choices[0].delta.content, end="")
print()

Deploy an Inference Component Endpoint

With inference components, you can host multiple models on a single endpoint, ensuring each model gets appropriately allocated resources.

Example of Deploying Inference Components

# Code to create the model, endpoint configuration, and endpoint goes here
# Followed by the inference component creation

Integrate with Strands Agents

Strands Agents is an open-source SDK for building AI agents. With OpenAI-compatible support, you can route multi-agent workflows entirely on your SageMaker AI infrastructure without exposing your data externally.

Example Integration Code

from openai import AsyncOpenAI
from strands import Agent, tool

@tool
def calculator(expression: str) -> str:
    return str(eval(expression))

# Setup the agents using Strands

Clean Up

To avoid incurring unnecessary charges, ensure you delete your endpoints and associated resources after use:

import boto3

# Cleanup code to delete endpoints, endpoint configurations, and models

Conclusion

The launch of OpenAI-compatible API support in Amazon SageMaker AI bridges the gap between existing AI applications and the scalable infrastructure they require. Developers can maintain their existing codebases while running inference on dedicated, reliable endpoints that meet GPU, scaling, and data residency demands.

To get started, simply deploy a model on a SageMaker AI real-time endpoint using a supported container, install the SageMaker Python SDK, and point your OpenAI client to the endpoint URL.

Ready to dive in? Check out the Amazon SageMaker AI Developer Guide for more details, or log into the Amazon SageMaker AI console to create your first endpoint.

About the Authors

Marc Karp

Marc is a Senior ML Architect with the Amazon SageMaker AI Service team, focusing on helping customers manage AI/ML workloads at scale.

Kareem Syed-Mohammed

Kareem is a Product Manager at AWS, specializing in generative AI model development on Amazon SageMaker.

Shrijeet Joshi

Shrijeet is a Senior Software Engineer at AWS, working on the core infrastructure of Amazon SageMaker AI’s real-time inference platform.

Dmitry Soldatkin

Dmitry is a Senior Machine Learning Solutions Architect at AWS, helping clients build AI/ML solutions across various industries.

Xu Deng

Xu is a Software Engineer Manager with the Amazon SageMaker AI team, passionate about optimizing AI/ML inference experiences.

Exclusive Content:

Introducing OpenAI-Compatible API Support for Amazon SageMaker AI Endpoints

Amazon SageMaker AI Unveils OpenAI-Compatible API Support for Real-Time Inference

Overview

Use Cases

Authentication and Security

Deployment and Invocation

Integration with Strands Agents

Clean Up

Conclusion

About the Authors

Unlocking Real-time Inference: Amazon SageMaker AI Goes OpenAI-Compatible

Overview

User Testimonials

Use Cases

1. Agentic Workflows on Owned Infrastructure

2. Multi-Model Hosting with a Single Interface

3. Serving Fine-Tuned Models Without Code Changes

Implementation Walkthrough

Prerequisites

Bearer Token Authentication

Example Token Generation Script

Auto-refresh Tokens for Long-running Applications

Deploying a Single-Model Endpoint

Invoking a Single-Model Endpoint

Deploy an Inference Component Endpoint

Example of Deploying Inference Components

Integrate with Strands Agents

Example Integration Code

Clean Up

Conclusion

About the Authors

Marc Karp

Kareem Syed-Mohammed

Shrijeet Joshi

Dmitry Soldatkin

Xu Deng

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe