Evaluating AI Agents: A Comprehensive Guide to Reliable Assessment
This post was co-authored with Karan Singh, Head of Partnerships at LangChain.
Understanding the Challenges of AI Agent Evaluation
Validating AI agent behavior before production is one of the hardest problems in applied AI. Agents are non-deterministic, multi-step systems where errors in early steps can affect downstream results.
The Path to Improvement: LangSmith on AWS
LangSmith on AWS provides an evaluation framework to identify issues early, track them in production, and improve the reliability of your agents over their lifecycle.
Learnings from Industry Leaders
This post synthesizes insights from LangChain’s work on evaluating deep agents and Anthropic’s guide to demystifying evaluations for AI agents, offering a practical guide on how to apply five evaluation patterns for deep agents.
Key Takeaways: Evaluation Patterns and Frameworks
You will learn how to:
- Implement offline evaluations with pytest and LangSmith.
- Set up online monitoring for production.
- Evaluate a text-to-SQL deep agent using Amazon Bedrock.
Introducing Amazon Nova 2 Lite
Amazon Nova 2 Lite is an adaptable reasoning model tailored for agentic workloads, supporting various input types with a 1 million-token context window.
The Structure of Agent Evaluations
An evaluation is a test for an AI system, and the complexity of agent behavior necessitates a deep dive into specific terminologies and challenges in evaluation methodology.
Validating AI Agents: A Comprehensive Approach with LangSmith on AWS
This post was co-authored with Karan Singh, Head of Partnerships at LangChain.
Validating AI agent behavior before production is one of the most challenging problems in applied AI. Agents are non-deterministic and operate through multi-step processes; errors in early steps can significantly affect downstream results. Even a single bad tool call can cascade through an entire workflow, leading to undesirable outcomes. Fortunately, LangSmith on AWS provides an evaluation framework designed to catch these issues early, track them in production, and continuously improve the agent’s reliability throughout its lifecycle.
This post synthesizes learnings from LangChain’s work on evaluating deep agents and insights from Anthropic’s guide to demystifying evals for AI agents. You’ll learn how to:
- Apply five evaluation patterns for deep agents.
- Build offline evaluations using pytest and LangSmith.
- Configure online monitoring for production.
Our walkthrough will focus on a text-to-SQL deep agent integrated with Amazon Bedrock to showcase the complete development-to-production lifecycle.
Understanding Evaluation Structures
An evaluation tests the performance of an AI system: it feeds an AI an input, applies grading logic to its output, and measures success. While this is straightforward for large language model (LLM) calls, the complexity increases dramatically for agents due to various intertwined components.
Key Terminology
- Task: A single test with defined inputs and success criteria. E.g., asking, “How many customers are from Canada?” with the expected answer being eight.
- Trial: A single attempt at a task, usually involving multiple runs for better reliability.
- Grader: Logic that evaluates and scores performance. Tasks can have multiple graders for different dimensions.
- Transcript: The complete record of a trial, including tool calls, reasoning steps, and intermediate results. LangSmith provides full traceability for debugging.
- Outcome: The final state of the environment after a trial—did the agent execute the correct SQL query against the database?
- Evaluation Harness: The infrastructure that conducts evaluations end-to-end.
- Evaluation Suite: A collection of tasks targeting specific agent capabilities or behaviors.
Why Agent Evaluations are Harder
Three factors complicate agent evaluations compared to straightforward LLM outputs:
- Non-Determinism: Agent behavior can vary. A single pass/fail result is insufficient, and you may need multiple trials to gauge actual performance.
- Error Propagation: Mistakes made in earlier steps can cascade, highlighting the need for evaluating individual steps rather than just the final output.
- Creative Solutions: Advanced models may produce valid approaches that didn’t make it into initial evaluation designs.
What You Can Evaluate
You can test three categories for an agent run:
- Trajectory: The sequence of tools called and the arguments generated by the agent.
- Final Response: The output returned to the user, assessing correctness and formatting.
- Other State: Additional artifacts the agent produced, like files or intermediate results.
Evaluation Patterns for AI Agents
To effectively evaluate agents, combining different types of graders is crucial. Here are three primary grader types:
1. Code-Based Graders
Use deterministic logic to verify specific conditions through string matching, regex patterns, etc. They are fast, cheap, and easy to debug, but can be brittle when variations exist.
Example:
# Check if the agent executed a SQL query
tool_names = [tc["name"] for tc in tool_calls]
assert "sql_db_query" in tool_names, "Agent must execute sql_db_query"
2. Model-Based Graders (LLM-as-Judge)
Employ another LLM to evaluate the agent’s output. They can capture nuance and complexity but are non-deterministic and more resource-intensive.
Example:
rubric = """Score the agent's answer on these dimensions (0.0 to 1.0):
1. correctness: Does it identify the right top employee?
2. completeness: Does it include revenue broken down by country?
3. clarity: Is the answer well-formatted and easy to understand?
Return JSON: {"correctness": float, "completeness": float, "clarity": float}"""
judge_response = model.invoke(rubric.format(answer=answer))
scores = json.loads(judge_response.content)
3. Human Graders
Considered the gold standard for subjective quality assessments, though they are slow and expensive compared to programmatic evaluators. They are essential for calibrating LLM-based grading.
Combining Graders: Practical Recommendations
For a text-to-SQL agent evaluation, use a mix of grading types:
- Code-based: Verify tool calls and basic correctness.
- LLM-as-judge: Assess complex queries requiring nuanced interpretation.
- Human: Conduct periodic spot-checks to calibrate LLM grading.
Capability vs. Regression Evaluations
There are two primary evaluation types:
- Capability Evaluations: Explore what an agent can do well.
- Regression Evaluations: Ensure that the agent continues to perform tasks it used to do, targeting nearly 100% pass rates.
Evaluating Deep Agents
Deep agents, which use planning and tool use, necessitate different evaluation strategies. Here are four patterns derived from LangChain’s applications of deep agent architectures:
- Custom Test Logic Per Data Point: Each test case might require unique assertions against trajectories and states.
- Single-Step Evaluations: Validate individual decision points, such as the first action taken by an agent after user input.
- Full Agent Turns: Test the agent in an end-to-end fashion, assessing both trajectory and output.
- Multi-Turn Evaluations: Evaluate how an agent responds across extended conversational contexts.
End-to-End Example: Evaluating a Text-to-SQL Deep Agent on AWS
Let’s look at an example using LangChain’s text-to-SQL deep agent in Amazon Bedrock:
Prerequisites
- An AWS account with Amazon Bedrock access.
- A LangSmith account and API key.
- Python 3.12 or higher.
Setup
Clone the repository and install the necessary dependencies:
git clone https://github.com/aws-samples/sample-text2sql-deep-agent-evalulation
cd langsmith-deep-agents-eval
python -m venv .venv
source .venv/bin/activate # On Windows: .venv\Scripts\activate
pip install -e .
Building the Evaluation Suite
Applying evaluation patterns, here are some evaluate examples using LangSmith’s pytest integration:
@pytest.mark.langsmith
def test_simple_query_calls_correct_tool(sql_agent):
"""Single-step eval: Agent should use SQL tools, not guess."""
question = "How many customers are from Canada?"
result = sql_agent.invoke({
"messages": [{"role": "user", "content": question}]
})
tool_names = [tc["name"] for tc in extract_tool_calls(result["messages"])]
assert "sql_db_query" in tool_names, "Agent must use SQL tools."
Viewing Results in LangSmith
Every test case is logged automatically as an experiment in LangSmith, allowing users to inspect full traces, track feedback scores, and monitor token usage.
From Offline to Online: Production Monitoring with LangSmith
After developing and running offline evaluations, the next step is online monitoring. In production, real users may introduce queries that were never anticipated.
Types of Online Evaluators
- Code Evaluators: Fast checks for safety violations.
- LLM-as-Judge Evaluators: Assess answer quality based on internal consistency and clarity.
- Composite Evaluators: Aggregate multiple scores into a single, actionable metric.
Conclusion
AI agents demand specialized evaluation strategies. The five patterns provided by LangChain serve as a comprehensive framework. By employing these methods throughout an agent’s lifecycle—from developing offline evaluations to consistent online monitoring—you can significantly enhance your agent’s behavior and reliability.
To get started, explore the companion repository for the complete working example and dive deeper into the services used in this post, including Amazon Bedrock and Amazon Nova.
About the Authors
Jagdeep Singh Soni is a Senior AI/ML Solutions Architect at AWS based in the Netherlands, specializing in generative AI.
Ajeet Tewari is a Senior Solutions Architect for Amazon Web Services, helping enterprise customers navigate their AWS journeys.
Anuj Jauhari is a Senior Product Marketing Manager Technical for Amazon Nova, combining technical depth with strategic storytelling.
Karan Singh is Head of Partnerships at LangChain, leading the company’s partner ecosystem across cloud providers and technology ISVs.
Feel free to explore the GitHub repository linked above for a detailed implementation guide. Happy coding!