Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Creating a Custom Model Provider for Strands Agents Using LLMs on SageMaker AI Endpoints

Bridging the Gap: Creating Custom Model Parsers for Strands Agents on Amazon SageMaker

Navigating Response Format Incompatibilities

Understanding Strands Custom Parsers

Implementation Overview

Step 1: Install ml-container-creator

Step 2: Generate Deployment Project

Step 3: Build and Deploy

Understanding the Response Format

Implementing a Custom Model Parser

Conclusion

Key Takeaways

About the Authors

Building Custom Model Parsers for Strands Agents with Amazon SageMaker

Organizations are increasingly harnessing the power of custom large language models (LLMs) hosted on Amazon SageMaker AI real-time endpoints. By leveraging preferred serving frameworks like SGLang, vLLM, or TorchServe, they’re optimizing costs and ensuring greater control over their deployments. However, this flexibility brings a notable challenge: response format compatibility with Strands agents.

The Challenge

While many custom serving frameworks return responses in OpenAI-compatible formats, Strands agents expect responses that align with the Bedrock Messages API. The misalignment causes integration issues despite both systems functioning independently. Although the Amazon Bedrock Mantle distributed inference engine has supported OpenAI messaging formats since December 2025, SageMaker’s flexibility means that diverse models can introduce unique prompt and response formats—many of which do not conform to standard APIs.

Bridging the Gap

The solution to this challenge lies in crafting custom model parsers. By extending the SageMakerAIModel, organizations can translate the format of their model server’s responses into the expected format for Strands agents. This approach allows them to utilize their chosen serving frameworks while maintaining compatibility with the Strands Agents SDK.

Implementation Overview

This blog will guide you through the process of building custom model parsers for Strands agents while deploying Llama 3.1 with SGLang on SageMaker using the awslabs/ml-container-creator tool.

Implementation Layers

Our implementation consists of three primary layers:

  1. Model Deployment Layer: Serving Llama 3.1 with SGLang to return OpenAI-compatible responses.
  2. Parser Layer: Creating a custom LlamaModelProvider class that extends SageMakerAIModel to handle Llama 3.1’s response format.
  3. Agent Layer: Developing a Strands agent that utilizes the custom provider for conversational AI, effectively parsing the model’s responses.

Step 1: Install ml-container-creator

We’ll begin by installing the necessary tools to create the serving container for our model.

# Install Yeoman globally
npm install -g yo

# Clone and install ml-container-creator
git clone https://github.com/awslabs/ml-container-creator
cd ml-container-creator
npm install && npm link

# Verify installation
yo --generators # Should show ml-container-creator

Step 2: Generate Deployment Project

After the installation, we can generate a deployment project featuring our selected model and serving framework.

# Run the generator
yo ml-container-creator

# Configuration options:
# - Framework: transformers
# - Model Server: sglang
# - Model: meta-llama/Llama-3.1-8B-Instruct
# - Deploy Target: codebuild
# - Instance Type: ml.g6.12xlarge (GPU)
# - Region: us-east-1

This will create a structured project with necessary components, such as the Dockerfile, build configuration, and deployment scripts.

Step 3: Build and Deploy

Now, we can build and deploy the created container to SageMaker.

cd llama-31-deployment

# Build container with CodeBuild
./deploy/submit_build.sh

# Deploy to SageMaker
./deploy/deploy.sh arn:aws:iam::ACCOUNT:role/SageMakerExecutionRole

This process builds the Docker image, pushes it to Amazon Elastic Container Registry (ECR), and finally creates a real-time endpoint on SageMaker.

Step 4: Understanding the Response Format

Llama 3.1 returns responses in an OpenAI-compatible format, while Strands requires adherence to the Bedrock Messages API format. Here’s an example of Llama’s response:

{
  "id": "cmpl-abc123",
  "object": "chat.completion",
  "created": 1704067200,
  "model": "meta-llama/Llama-3.1-8B-Instruct",
  "choices": [{
    "index": 0,
    "message": {"role": "assistant", "content": "I'm doing well, thank you for asking!"},
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 23,
    "completion_tokens": 12,
    "total_tokens": 35
  }
}

With the difference in formats established, we need to implement a custom model parser to ensure smooth interaction.

Step 5: Implementing a Custom Model Parser

The following is a simplified version of how to create a stream method in the custom model parser:

def stream(self, messages: List[Dict[str, Any]], tool_specs: list, system_prompt: Optional[str], **kwargs):
    # Build payload messages
    payload_messages = []
    if system_prompt:
        payload_messages.append({"role": "system", "content": system_prompt})

    # Add user messages
    for msg in messages:
        payload_messages.append({"role": "user", "content": msg['content'][0]['text']})

    payload = {
        "messages": payload_messages,
        "max_tokens": kwargs.get('max_tokens', self.max_tokens),
        "temperature": kwargs.get('temperature', self.temperature),
        "stream": True,
    }

    try:
        response = self.runtime_client.invoke_endpoint_with_response_stream(
            EndpointName=self.endpoint_name,
            ContentType="application/json",
            Accept="application/json",
            Body=json.dumps(payload)
        )

        # Processing streaming response
        for event in response['Body']:
            chunk = event['PayloadPart']['Bytes'].decode('utf-8')
            # Extract and yield data...
    except Exception as e:
        yield {
            "type": "error",
            "error": {
                "message": f"Endpoint invocation failed: {str(e)}",
                "type": "EndpointInvocationError"
            }
        }

This stream method allows the Strands agent to properly parse and usually respond according to its expectations.

Step 6: Initialize and Test Your Agent

Once the custom parser is implemented, initializing a Strands agent becomes straightforward:

from strands.agent import Agent

# Initialize custom provider
provider = LlamaModelProvider(
  endpoint_name="llama-31-deployment-endpoint",
  region_name="us-east-1",
  max_tokens=1000,
  temperature=0.7
)

# Create the agent
agent = Agent(
  name="llama-assistant",
  model=provider,
  system_prompt="You are a helpful AI assistant powered by Llama 3.1, deployed on Amazon SageMaker."
)

# Test the agent
response = agent("What are the key benefits of deploying LLMs on SageMaker?")
print(response.content)

The complete implementation, including a Jupyter notebook and the associated GitHub repository, offers detailed explanations and a hands-on approach to effectively build your own custom model parser.

Conclusion

Creating custom model parsers for Strands agents enables seamless integration of various LLM deployments on SageMaker, regardless of their response formats. By extending SageMakerAIModel and implementing the necessary parsing logic, organizations can leverage their chosen serving frameworks without sacrificing compatibility.

Key Takeaways

  • The awslabs/ml-container-creator tool simplifies the deployment of BYOC models on SageMaker.
  • Custom parsers are essential for bridging the gap between diverse model server response formats and Strands’ expectations.
  • The stream() method is a pivotal integration point for custom providers.

By following this guide, you’re better equipped to deploy and integrate advanced dialogue systems, unlocking the true potential of LLMs in your applications.

About the Author

Dan Ferguson is a Sr. Solutions Architect at AWS, based in New York, USA. As a machine learning services expert, Dan supports customers in effectively integrating ML workflows to achieve sustainable solutions.

Latest

Revolutionize Retail Using AWS Generative AI Solutions

Transforming Online Retail with Virtual Try-On Solutions: A Complete...

OpenAI Refocuses on Business Users in Response to Growing Demands

The Shift Towards Business-Oriented AI: OpenAI's Strategic Moves and...

UK Conducts Tests on Robotic Systems for CBR Cleanup

Advancements in Uncrewed Systems for CBR Detection and Decontamination:...

Bias Linked to Negative Language in SCD Clinical Notes

Study Examines Bias in Electronic Health Records for Sickle...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Revolutionize Retail Using AWS Generative AI Solutions

Transforming Online Retail with Virtual Try-On Solutions: A Complete Guide to Building on AWS Overcoming Fit and Look Challenges in E-commerce Solution Overview: AI-Powered Capabilities for...

Crafting Engaging, Custom Tooltips in Amazon QuickSight

Enhancing Data Exploration in Amazon QuickSight with Custom Sheet Tooltips Introduction to Amazon QuickSight Amazon QuickSight, the unified business intelligence service from AWS, empowers users with...

Deployments Based on Use Cases in SageMaker JumpStart

Introducing Amazon SageMaker JumpStart Optimized Deployments Overview of SageMaker JumpStart Amazon SageMaker JumpStart provides pretrained models to kickstart your AI workloads, making it easy to deploy...