Transforming AI Agents: Enabling Seamless Long-Running Task Management
Introduction to AI’s Evolution in Task Handling
Common Approaches to Handling Long-Running Tasks
- Context Messaging
- Async Task Management
Context Messaging: Keeping Connections Alive
- Implementation Overview
- When to Use Context Messaging
- Limitations of Context Messaging
Async Task Management: The Fire-and-Forget Model
- Implementation Overview
- Limitations of Async Task Management
Moving Toward a Robust Solution
- Integrating External Persistence
Implementation with Amazon Bedrock AgentCore and Strands Agents
- MCP Server Implementation
- Strands Agents Integration
Conclusion: Building Reliable AI Agents for Complex Tasks
About the Authors
Transforming AI Agents into Autonomous Workers with Persistent State Management
Introduction
Artificial Intelligence (AI) agents are rapidly evolving from simple chat interfaces into sophisticated autonomous workers capable of handling complex, time-intensive tasks across various sectors. As organizations increasingly deploy AI agents to train machine learning (ML) models, process large datasets, and run intricate simulations, a new standard for agent-server integration—the Model Context Protocol (MCP)—has emerged. However, a significant challenge persists: many of these operations can take minutes or hours to complete, far exceeding conventional session timeframes.
Imagine your AI agent initiating a multi-hour data processing job, only for you to close your laptop, returning days later to find the completed results. This seamless interaction requires innovative solutions for managing task states across sessions. By utilizing Amazon Bedrock AgentCore and Strands Agents for persistent state management, organizations can enable reliable task execution in production environments. But how can we achieve this?
Achieving Persistent Task Execution: An Overview
This blog post outlines a comprehensive approach to ensure seamless, cross-session task execution. We will:
- Introduce a context message strategy that maintains ongoing communication between servers and clients during extended operations.
- Develop an asynchronous task management framework for AI agents to initiate long-running processes without blocking other operations.
- Showcase how to combine these strategies with Amazon Bedrock AgentCore and Strands Agents for robust, production-ready AI agents.
Common Approaches for Handling Long-Running Tasks
When designing MCP servers for long-running tasks, a fundamental architectural decision arises: should the server maintain an active connection with real-time updates, or should it decouple task execution from the initial request? This decision leads to two distinct approaches: context messaging and asynchronous task management.
Using Context Messaging
The context messaging approach keeps continuous communication between the MCP server and client during task execution. It utilizes MCP’s built-in context object to send periodic updates to the client. This method excels for tasks expected to complete within 10–15 minutes, providing several advantages:
- Straightforward implementation
- No additional polling logic needed
- Minimal overhead
- Simple client integration
Using Asynchronous Task Management
In contrast, the asynchronous task management approach separates task initiation from execution and result retrieval. After initiating the MCP tool, it sends a task ID while executing the task in the background. This model is ideal for enterprise scenarios where tasks might run for hours, and users are allowed the flexibility to disconnect and reconnect. The benefits include:
- True fire-and-forget operation
- Support for long-running tasks (hours)
- Data loss prevention via persistent storage
- Resilience against network interruptions
Implementing Context Messaging
Context Messaging serves as a solution for moderately long operations, maintaining active connections. For instance, if a data scientist uses an MCP server to train a complex ML model that takes 10–15 minutes, a proper strategy must be in place to ensure the connection doesn’t drop due to time limits. Here’s the workflow:
from mcp.server.fastmcp import Context, FastMCP
import asyncio
mcp = FastMCP(host="0.0.0.0", stateless_http=True)
@mcp.tool()
async def model_training(model_name: str, epochs: int, ctx: Context) -> str:
for i in range(epochs):
progress = (i + 1) / epochs
await asyncio.sleep(5)
await ctx.report_progress(progress=progress, total=1.0,
message=f"Step {i + 1}/{epochs}")
return f"{model_name} training completed."
if __name__ == "__main__":
mcp.run(transport="streamable-http")
In this code sample, the Context object enables progress updates during model training, effectively keeping the connection alive.
Limitations of Context Messaging
While context messaging has its benefits, it comes with limitations, including:
- Continuous connection required
- Resource consumption for open connections
- Vulnerability to network instability
For truly long-running operations, consider transitioning to asynchronous task management.
Implementing Asynchronous Task Management
The asynchronous task management pattern enables a "fire-and-forget" model, where tasks are initiated, processed in the background, and results can be checked later. The workflow includes:
- Task initiation: Client requests a task and receives a task ID.
- Background processing: Server executes the task without requiring an active client connection.
- Status checking: Clients can reconnect and check progress using the task ID.
- Result retrieval: Results can be fetched whenever needed.
from mcp.server.fastmcp import FastMCP
import asyncio
import uuid
mcp = FastMCP(host="0.0.0.0", stateless_http=True)
tasks = {}
async def _execute_model_training(task_id: str, model_name: str, epochs: int):
for i in range(epochs):
tasks[task_id]["progress"] = (i + 1) / epochs
await asyncio.sleep(2)
tasks[task_id]["status"] = "completed"
tasks[task_id]["result"] = f"{model_name} training completed."
@mcp.tool()
def model_training(model_name: str, epochs: int = 10) -> str:
task_id = str(uuid.uuid4())
tasks[task_id] = {"status": "started", "progress": 0.0}
asyncio.create_task(_execute_model_training(task_id, model_name, epochs))
return f"Model Training initiated with task ID: {task_id}."
@mcp.tool()
def check_task_status(task_id: str):
return tasks.get(task_id, {"error": "Task not found"})
if __name__ == "__main__":
mcp.run(transport="streamable-http")
The tasks are stored in-memory, allowing clients to check task status independently.
Limitations and Moving Toward Solutions
However, in-memory task management is fragile. If the server restarts, all task information is lost. Therefore, integrating with external persistent storage—like Amazon Bedrock AgentCore Memory—ensures data is not lost due to server issues.
Amazon Bedrock AgentCore and Strands Agents Implementation
Persistent State Management
By integrating Amazon Bedrock AgentCore with Strands Agents, we can manage persistent states effectively. Here’s how the MCP server uses AgentCore Memory:
from bedrock_agentcore.memory import MemoryClient
async def _execute_model_training(model_name: str, epochs: int, memory_id: str):
for i in range(epochs):
await asyncio.sleep(2)
response = agentcore_memory_client.create_event(memory_id=memory_id, ...)
This approach allows users to retrieve task results even after a disconnection by storing task outcomes directly to AgentCore Memory.
Workflow with Strands Agents
Integrating with Strands Agents enhances conversational context management. Users provide session identifiers for each interaction, facilitating a continuous experience even after disconnections.
Conclusion
In this post, we explored practical approaches for AI agents to manage long-running tasks effectively. By leveraging context messaging and asynchronous task management combined with persistent state management, organizations can build reliable AI agents capable of performing complex tasks without losing data or frustrating users.
We encourage you to try implementing these strategies in your own AI projects. Think about the enhancements they could bring to your AI assistants and how they could transform user experiences.
To further enhance your understanding, check out the official Amazon Bedrock AgentCore documentation and explore sample notations.
About the Authors
Haochen Xie, Flora Wang, Yuan Tian, and Hari Prasanna Das are experts at the AWS Generative AI Innovation Center, focusing on making generative AI solutions robust and user-friendly across various industries.