Transforming AI Agents: Enabling Seamless Long-Running Task Management

Introduction to AI’s Evolution in Task Handling

Common Approaches to Handling Long-Running Tasks

Context Messaging
Async Task Management

Context Messaging: Keeping Connections Alive

Implementation Overview
When to Use Context Messaging
Limitations of Context Messaging

Async Task Management: The Fire-and-Forget Model

Implementation Overview
Limitations of Async Task Management

Moving Toward a Robust Solution

Integrating External Persistence

Implementation with Amazon Bedrock AgentCore and Strands Agents

MCP Server Implementation
Strands Agents Integration

Conclusion: Building Reliable AI Agents for Complex Tasks

About the Authors

Transforming AI Agents into Autonomous Workers with Persistent State Management

Introduction

Artificial Intelligence (AI) agents are rapidly evolving from simple chat interfaces into sophisticated autonomous workers capable of handling complex, time-intensive tasks across various sectors. As organizations increasingly deploy AI agents to train machine learning (ML) models, process large datasets, and run intricate simulations, a new standard for agent-server integration—the Model Context Protocol (MCP)—has emerged. However, a significant challenge persists: many of these operations can take minutes or hours to complete, far exceeding conventional session timeframes.

Imagine your AI agent initiating a multi-hour data processing job, only for you to close your laptop, returning days later to find the completed results. This seamless interaction requires innovative solutions for managing task states across sessions. By utilizing Amazon Bedrock AgentCore and Strands Agents for persistent state management, organizations can enable reliable task execution in production environments. But how can we achieve this?

Achieving Persistent Task Execution: An Overview

This blog post outlines a comprehensive approach to ensure seamless, cross-session task execution. We will:

Introduce a context message strategy that maintains ongoing communication between servers and clients during extended operations.
Develop an asynchronous task management framework for AI agents to initiate long-running processes without blocking other operations.
Showcase how to combine these strategies with Amazon Bedrock AgentCore and Strands Agents for robust, production-ready AI agents.

Common Approaches for Handling Long-Running Tasks

When designing MCP servers for long-running tasks, a fundamental architectural decision arises: should the server maintain an active connection with real-time updates, or should it decouple task execution from the initial request? This decision leads to two distinct approaches: context messaging and asynchronous task management.

Using Context Messaging

The context messaging approach keeps continuous communication between the MCP server and client during task execution. It utilizes MCP’s built-in context object to send periodic updates to the client. This method excels for tasks expected to complete within 10–15 minutes, providing several advantages:

Straightforward implementation
No additional polling logic needed
Minimal overhead
Simple client integration

Using Asynchronous Task Management

In contrast, the asynchronous task management approach separates task initiation from execution and result retrieval. After initiating the MCP tool, it sends a task ID while executing the task in the background. This model is ideal for enterprise scenarios where tasks might run for hours, and users are allowed the flexibility to disconnect and reconnect. The benefits include:

True fire-and-forget operation
Support for long-running tasks (hours)
Data loss prevention via persistent storage
Resilience against network interruptions

Implementing Context Messaging

Context Messaging serves as a solution for moderately long operations, maintaining active connections. For instance, if a data scientist uses an MCP server to train a complex ML model that takes 10–15 minutes, a proper strategy must be in place to ensure the connection doesn’t drop due to time limits. Here’s the workflow:

from mcp.server.fastmcp import Context, FastMCP
import asyncio

mcp = FastMCP(host="0.0.0.0", stateless_http=True)

@mcp.tool()
async def model_training(model_name: str, epochs: int, ctx: Context) -> str:
    for i in range(epochs):
        progress = (i + 1) / epochs
        await asyncio.sleep(5)
        await ctx.report_progress(progress=progress, total=1.0,
                                   message=f"Step {i + 1}/{epochs}")

    return f"{model_name} training completed."

if __name__ == "__main__":
    mcp.run(transport="streamable-http")

In this code sample, the Context object enables progress updates during model training, effectively keeping the connection alive.

Limitations of Context Messaging

While context messaging has its benefits, it comes with limitations, including:

Continuous connection required
Resource consumption for open connections
Vulnerability to network instability

For truly long-running operations, consider transitioning to asynchronous task management.

Implementing Asynchronous Task Management

The asynchronous task management pattern enables a "fire-and-forget" model, where tasks are initiated, processed in the background, and results can be checked later. The workflow includes:

Task initiation: Client requests a task and receives a task ID.
Background processing: Server executes the task without requiring an active client connection.
Status checking: Clients can reconnect and check progress using the task ID.
Result retrieval: Results can be fetched whenever needed.

from mcp.server.fastmcp import FastMCP
import asyncio
import uuid

mcp = FastMCP(host="0.0.0.0", stateless_http=True)
tasks = {}

async def _execute_model_training(task_id: str, model_name: str, epochs: int):
    for i in range(epochs):
        tasks[task_id]["progress"] = (i + 1) / epochs
        await asyncio.sleep(2)
    tasks[task_id]["status"] = "completed"
    tasks[task_id]["result"] = f"{model_name} training completed."

@mcp.tool()
def model_training(model_name: str, epochs: int = 10) -> str:
    task_id = str(uuid.uuid4())
    tasks[task_id] = {"status": "started", "progress": 0.0}
    asyncio.create_task(_execute_model_training(task_id, model_name, epochs))
    return f"Model Training initiated with task ID: {task_id}."

@mcp.tool()
def check_task_status(task_id: str):
    return tasks.get(task_id, {"error": "Task not found"})

if __name__ == "__main__":
    mcp.run(transport="streamable-http")

The tasks are stored in-memory, allowing clients to check task status independently.

Limitations and Moving Toward Solutions

However, in-memory task management is fragile. If the server restarts, all task information is lost. Therefore, integrating with external persistent storage—like Amazon Bedrock AgentCore Memory—ensures data is not lost due to server issues.

Amazon Bedrock AgentCore and Strands Agents Implementation

Persistent State Management

By integrating Amazon Bedrock AgentCore with Strands Agents, we can manage persistent states effectively. Here’s how the MCP server uses AgentCore Memory:

from bedrock_agentcore.memory import MemoryClient

async def _execute_model_training(model_name: str, epochs: int, memory_id: str):
    for i in range(epochs):
        await asyncio.sleep(2)
    response = agentcore_memory_client.create_event(memory_id=memory_id, ...)

This approach allows users to retrieve task results even after a disconnection by storing task outcomes directly to AgentCore Memory.

Workflow with Strands Agents

Integrating with Strands Agents enhances conversational context management. Users provide session identifiers for each interaction, facilitating a continuous experience even after disconnections.

Conclusion

In this post, we explored practical approaches for AI agents to manage long-running tasks effectively. By leveraging context messaging and asynchronous task management combined with persistent state management, organizations can build reliable AI agents capable of performing complex tasks without losing data or frustrating users.

We encourage you to try implementing these strategies in your own AI projects. Think about the enhancements they could bring to your AI assistants and how they could transform user experiences.

To further enhance your understanding, check out the official Amazon Bedrock AgentCore documentation and explore sample notations.

About the Authors

Haochen Xie, Flora Wang, Yuan Tian, and Hari Prasanna Das are experts at the AWS Generative AI Innovation Center, focusing on making generative AI solutions robust and user-friendly across various industries.

Exclusive Content:

Create Persistent MCP Servers on Amazon Bedrock AgentCore with Strands Agents Integration

Transforming AI Agents: Enabling Seamless Long-Running Task Management

Introduction to AI’s Evolution in Task Handling

Common Approaches to Handling Long-Running Tasks

Context Messaging: Keeping Connections Alive

Async Task Management: The Fire-and-Forget Model

Moving Toward a Robust Solution

Implementation with Amazon Bedrock AgentCore and Strands Agents

Conclusion: Building Reliable AI Agents for Complex Tasks

About the Authors

Transforming AI Agents into Autonomous Workers with Persistent State Management

Introduction

Achieving Persistent Task Execution: An Overview

Common Approaches for Handling Long-Running Tasks

Using Context Messaging

Using Asynchronous Task Management

Implementing Context Messaging

Limitations of Context Messaging

Implementing Asynchronous Task Management

Limitations and Moving Toward Solutions

Amazon Bedrock AgentCore and Strands Agents Implementation

Persistent State Management

Workflow with Strands Agents

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe