Bridging the Gap: Creating Custom Model Parsers for Strands Agents on Amazon SageMaker
Navigating Response Format Incompatibilities
Understanding Strands Custom Parsers
Implementation Overview
Step 1: Install ml-container-creator
Step 2: Generate Deployment Project
Step 3: Build and Deploy
Understanding the Response Format
Implementing a Custom Model Parser
Conclusion
Key Takeaways
About the Authors
Building Custom Model Parsers for Strands Agents with Amazon SageMaker
Organizations are increasingly harnessing the power of custom large language models (LLMs) hosted on Amazon SageMaker AI real-time endpoints. By leveraging preferred serving frameworks like SGLang, vLLM, or TorchServe, they’re optimizing costs and ensuring greater control over their deployments. However, this flexibility brings a notable challenge: response format compatibility with Strands agents.
The Challenge
While many custom serving frameworks return responses in OpenAI-compatible formats, Strands agents expect responses that align with the Bedrock Messages API. The misalignment causes integration issues despite both systems functioning independently. Although the Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging formats since December 2025, SageMaker’s flexibility means that diverse models can introduce unique prompt and response formats—many of which do not conform to standard APIs.
Bridging the Gap
The solution to this challenge lies in crafting custom model parsers. By extending the SageMakerAIModel, organizations can translate the format of their model server’s responses into the expected format for Strands agents. This approach allows them to utilize their chosen serving frameworks while maintaining compatibility with the Strands Agents SDK.
Implementation Overview
This blog will guide you through the process of building custom model parsers for Strands agents while deploying Llama 3.1 with SGLang on SageMaker using the awslabs/ml-container-creator tool.
Implementation Layers
Our implementation consists of three primary layers:
- Model Deployment Layer: Serving Llama 3.1 with SGLang to return OpenAI-compatible responses.
- Parser Layer: Creating a custom
LlamaModelProviderclass that extendsSageMakerAIModelto handle Llama 3.1’s response format. - Agent Layer: Developing a Strands agent that utilizes the custom provider for conversational AI, effectively parsing the model’s responses.
Step 1: Install ml-container-creator
We’ll begin by installing the necessary tools to create the serving container for our model.
# Install Yeoman globally
npm install -g yo
# Clone and install ml-container-creator
git clone https://github.com/awslabs/ml-container-creator
cd ml-container-creator
npm install && npm link
# Verify installation
yo --generators # Should show ml-container-creator
Step 2: Generate Deployment Project
After the installation, we can generate a deployment project featuring our selected model and serving framework.
# Run the generator
yo ml-container-creator
# Configuration options:
# - Framework: transformers
# - Model Server: sglang
# - Model: meta-llama/Llama-3.1-8B-Instruct
# - Deploy Target: codebuild
# - Instance Type: ml.g6.12xlarge (GPU)
# - Region: us-east-1
This will create a structured project with necessary components, such as the Dockerfile, build configuration, and deployment scripts.
Step 3: Build and Deploy
Now, we can build and deploy the created container to SageMaker.
cd llama-31-deployment
# Build container with CodeBuild
./deploy/submit_build.sh
# Deploy to SageMaker
./deploy/deploy.sh arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole
This process builds the Docker image, pushes it to Amazon Elastic Container Registry (ECR), and finally creates a real-time endpoint on SageMaker.
Step 4: Understanding the Response Format
Llama 3.1 returns responses in an OpenAI-compatible format, while Strands requires adherence to the Bedrock Messages API format. Here’s an example of Llama’s response:
{
"id": "cmpl-abc123",
"object": "chat.completion",
"created": 1704067200,
"model": "meta-llama/Llama-3.1-8B-Instruct",
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": "I'm doing well, thank you for asking!"},
"finish_reason": "stop"
}],
"usage": {
"prompt_tokens": 23,
"completion_tokens": 12,
"total_tokens": 35
}
}
With the difference in formats established, we need to implement a custom model parser to ensure smooth interaction.
Step 5: Implementing a Custom Model Parser
The following is a simplified version of how to create a stream method in the custom model parser:
def stream(self, messages: List[Dict[str, Any]], tool_specs: list, system_prompt: Optional[str], **kwargs):
# Build payload messages
payload_messages = []
if system_prompt:
payload_messages.append({"role": "system", "content": system_prompt})
# Add user messages
for msg in messages:
payload_messages.append({"role": "user", "content": msg['content'][0]['text']})
payload = {
"messages": payload_messages,
"max_tokens": kwargs.get('max_tokens', self.max_tokens),
"temperature": kwargs.get('temperature', self.temperature),
"stream": True,
}
try:
response = self.runtime_client.invoke_endpoint_with_response_stream(
EndpointName=self.endpoint_name,
ContentType="application/json",
Accept="application/json",
Body=json.dumps(payload)
)
# Processing streaming response
for event in response['Body']:
chunk = event['PayloadPart']['Bytes'].decode('utf-8')
# Extract and yield data...
except Exception as e:
yield {
"type": "error",
"error": {
"message": f"Endpoint invocation failed: {str(e)}",
"type": "EndpointInvocationError"
}
}
This stream method allows the Strands agent to properly parse and usually respond according to its expectations.
Step 6: Initialize and Test Your Agent
Once the custom parser is implemented, initializing a Strands agent becomes straightforward:
from strands.agent import Agent
# Initialize custom provider
provider = LlamaModelProvider(
endpoint_name="llama-31-deployment-endpoint",
region_name="us-east-1",
max_tokens=1000,
temperature=0.7
)
# Create the agent
agent = Agent(
name="llama-assistant",
model=provider,
system_prompt="You are a helpful AI assistant powered by Llama 3.1, deployed on Amazon SageMaker."
)
# Test the agent
response = agent("What are the key benefits of deploying LLMs on SageMaker?")
print(response.content)
The complete implementation, including a Jupyter notebook and the associated GitHub repository, offers detailed explanations and a hands-on approach to effectively build your own custom model parser.
Conclusion
Creating custom model parsers for Strands agents enables seamless integration of various LLM deployments on SageMaker, regardless of their response formats. By extending SageMakerAIModel and implementing the necessary parsing logic, organizations can leverage their chosen serving frameworks without sacrificing compatibility.
Key Takeaways
- The
awslabs/ml-container-creatortool simplifies the deployment of BYOC models on SageMaker. - Custom parsers are essential for bridging the gap between diverse model server response formats and Strands’ expectations.
- The
stream()method is a pivotal integration point for custom providers.
By following this guide, you’re better equipped to deploy and integrate advanced dialogue systems, unlocking the true potential of LLMs in your applications.
About the Author
Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan supports customers in effectively integrating ML workflows to achieve sustainable solutions.