Systematic Evaluation of AI Agents: Challenges and Solutions with Strands Evals

Understanding the Unique Evaluation Needs of AI Agents

Core Concepts Behind Strands Evals

Connecting Agents to Evaluation through Task Functions

Comprehensive Assessment with Built-in Evaluators

Simulating User Interactions for Effective Multi-Turn Testing

Hierarchical Evaluation Levels: A Structured Approach

Leveraging Ground Truth for Accurate Assessment

Integrating Strands Evals into Your Development Workflow

Best Practices for Effective Agent Evaluation

Conclusion: Building Confidence in AI Agents Through Systematic Evaluation

Meet the Authors: Experts Behind Strands Evals

Evaluating AI Agents: Moving from Prototypes to Production with Strands Evals

Moving AI agents from prototyping to production presents challenges that traditional software testing simply cannot address. The inherent flexibility, adaptability, and context-awareness of these agents make them powerful but also difficult to evaluate systematically. Traditional software testing methodologies depend on deterministic outputs, which means that for the same input, the output should always be the same. AI agents, however, break this mold; they generate natural language responses and make context-dependent decisions, leading to varied outputs even from identical queries.

So, how do we evaluate something as unpredictable as an AI agent? In this post, we’ll explore how Strands Evals enables systematic evaluation of AI agents, providing a structured framework for monitoring their performance.

Why Evaluating AI Agents is Different

Let’s take a simple example—asking an AI agent, “What is the weather like in Tokyo?” There are numerous valid responses; the agent might give the temperature in Celsius or Fahrenheit, detail humidity and wind conditions, or simply mention the current weather. The challenge lies in the fact that while responses can be accurate, they are also subjectively evaluated. Traditional assertion-based testing fails here because it doesn’t accommodate these nuances.

Moreover, AI agents take actions—they retrieve information, call tools, and make decisions in real time. Simply evaluating the final output ignores the intermediate steps that led to that conclusion. Correctness becomes somewhat subjective and multifaceted. Even a factually accurate answer might not be helpful or align with the user’s needs.

Conversations with AI agents further complicate evaluation; they unfold over multiple exchanges, meaning early responses impact later ones. An agent might excel at single-turn interactions yet fail to maintain coherent context across dialogues.

The Need for Judgment-Based Evaluation

This is where structures like Strands Evals come in. They allow for nuanced evaluation of AI agents by employing large language models (LLMs) as evaluators, which can assess variables such as helpfulness, coherence, and fidelity—qualities that aren’t easily quantifiable.

Core Concepts of Strands Evals

Strands Evals introduces a framework that will feel familiar to anyone experienced in writing unit tests, but is tailored for AI agents’ requirements. It operates on three foundational concepts:

Cases: These represent individual test scenarios, complete with the input and optional expected outputs and tool trajectories.
Experiments: These bundle multiple cases along with one or more evaluators, representing the orchestration of the evaluation process.
Evaluators: These are the judges that assess the output generated by the agent, leveraging LLMs to offer nuanced evaluations rather than simple assertion checks.

Foundation of Evaluation

Case: Represents a single evaluation scenario.

from strands_evals import Case

case = Case(
    name="Weather Query",
    input="What is the weather like in Tokyo?",
    expected_output="Should include temperature and conditions",
    expected_trajectory=["weather_api"]
)

Experiment: A collection of cases intended to be evaluated together.
Evaluators: These provide context-aware judgments about the agent’s performance.

Connecting Agents to Evaluation with Task Functions

So how do agents actually connect to this evaluation framework? Enter the Task Function. This callable receives a Case and returns the results after running through the agent. It supports two evaluation patterns:

Online Evaluation: Involves live invocation of the agent.

from strands import Agent

def online_task(case):
    agent = Agent(tools=[search_tool, calculator_tool])
    result = agent(case.input)
    return {"output": str(result), "trajectory": agent.session}

Offline Evaluation: Works with historical data by retrieving previously recorded traces.

def offline_task(case):
    trace = load_trace_from_database(case.session_id)
    session = session_mapper.map_to_session(trace)
    return {"output": extract_final_response(trace), "trajectory": session}

Built-in Evaluators for Comprehensive Assessment

Strands Evals comes with ten built-in evaluators covering different quality dimensions:

Rubric-based Evaluators: Define custom criteria through natural language rubrics.
Semantic Evaluators: Evaluate common dimensions like helpfulness, faithfulness, and harmfulness.
Tool-level Evaluators: Assess individual tool invocations.
Session-level Evaluators: Look at entire conversation sessions to gauge overall goal achievement.

Choosing the Right Evaluators

Your evaluator choice should match your specific needs. For a customer service agent, you might prioritize helpfulness and goal success, while a research assistant may emphasize faithfulness.

Simulating Users for Multi-Turn Testing

Real conversations aren’t scripted; they often involve follow-up questions, confusion, and unexpected directions. Strands Evals includes an ActorSimulator to drive multi-turn conversations, thereby testing your agent in scenarios that mimic actual user interactions.

from strands_evals import Case, ActorSimulator
from strands import Agent

case = Case(input="I need help setting up a new bank account", metadata={"task_description": "Successfully open a checking account"})
user_sim = ActorSimulator.from_case_for_user_simulator(case=case, max_turns=10)

agent = Agent(system_prompt="You are a helpful banking assistant.")
user_message = case.input
while user_sim.has_next():
    agent_response = agent(user_message)
    user_result = user_sim.act(str(agent_response))
    user_message = str(user_result.structured_output.message)

Conclusion

As AI agents transition from prototypes to live environments, the structured evaluation provided by frameworks like Strands Evals will become increasingly vital. By allowing nuanced, multi-granular judgments of agent performance, Strands Evals empowers developers to create more robust, context-aware AI systems.

By leveraging online and offline evaluations, task functions, and diverse evaluators, successful implementation and monitoring of AI agents become manageable endeavors.

Stay tuned to further explore the potential of Strands Evals in refining your AI agents and enhancing user experiences.

About the Authors

Ishan Singh: Senior Applied Scientist at Amazon Web Services specializing in generative AI solutions.
Akarsha Sehwag: Generative AI Data Scientist focused on enterprise-level solutions in generative AI.
Po-Shin Chen: Software Developer dedicated to building agentic AI evaluation frameworks.
Jonathan Buck: Senior Software Engineer working on agent environments and evaluation infrastructures.
Smeet Dhakecha: Research Engineer specializing in agent simulations and evaluation systems.

For further reading, please check out the Strands Evals repository to get practical examples and start integrating systematic evaluations into your AI development workflow.

Exclusive Content:

Assessing AI Agents for Production: A Practical Guide to Strands Evaluations