Enhancing Conversational AI Evaluation: The Shift to Multi-Turn Interaction
Why Multi-Turn Evaluation is Fundamentally Harder
What Makes a Good Simulated User
How ActorSimulator Works
Getting Started with ActorSimulator
Integration with Evaluation Pipelines
Custom Actor Profiles for Targeted Testing
Best Practices for Simulation-Based Evaluation
Conclusion
About the Authors
Navigating the Complexities of Multi-Turn Conversations in AI Agent Evaluation
Evaluating single-turn agent interactions has always been relatively straightforward. You provide an input, gather the output, and assess the outcome. Frameworks like the Strands Evaluation SDK facilitate this process through evaluators that analyze aspects such as helpfulness, faithfulness, and tool usage. In our previous blog post, we discussed how to construct comprehensive evaluation suites for AI agents leveraging these capabilities. However, in real-world production, conversations rarely stop at a single turn.
The Realities of Multi-Turn Conversations
Real users engage in extended dialogues, which evolve over multiple exchanges. They often ask follow-up questions if they find answers lacking, shift topics as new information arises, or express frustration when their needs go unaddressed. For instance, a travel assistant might successfully respond to the query, “Book me a flight to Paris,” but may falter when the user subsequently asks for train options or hotels near the Eiffel Tower. Testing these fluid patterns requires more than static test cases with fixed inputs and outputs.
The Core Challenge: Scale
The primary obstacle is scale; manually conducting hundreds of multi-turn conversations every time your agent changes is impractical. Furthermore, scripting conversation flows limits your exploration to predefined paths, which often miss the unpredictable nature of real user behavior. Thus, evaluation teams need a way to programmatically generate realistic, goal-driven users who can interact naturally with an agent over multiple turns. This is where ActorSimulator in the Strands Evaluation SDK comes in—a structured user simulation tool that seamlessly integrates into your evaluation pipeline.
Why Multi-Turn Evaluation is Harder
Single-turn evaluation boasts a simple structure: the input is predetermined, the output is self-contained, and the context is confined to a single exchange. Multi-turn conversations, however, defy these assumptions. Each message relies on prior interactions; how the agent responds in one turn shapes the user’s next question. In this dynamic landscape, a static dataset of input/output pairs fails to capture the essence of real-world conversations, since the "correct" next user message hinges on the agent’s previous response.
While manual testing theoretically fills this gap, it often falls short in practice. Testers might replicate realistic multi-turn conversations, but doing so for every scenario and persona after every agent update is unsustainable. As an agent’s features expand, the number of conversation paths grows exponentially, quickly surpassing what teams can feasibly test manually. Some teams resort to prompt engineering, instructing a large language model (LLM) to “act like a user” during evaluations. Yet, without structured persona definitions and explicit goal tracking, results can vary, creating inconsistencies over time.
Defining Robust Simulated Users
Simulation-based testing is a well-trodden practice in other engineering fields. For example, flight simulators recreate scenarios that are either too dangerous or impossible to replicate in real life, while game engines utilize AI agents to explore countless player behavior pathways before launching. The same principle applies to conversational AI. In this context, it’s crucial to create a controlled environment where realistic actors engage with your system under defined conditions.
For effective AI agent evaluation, a well-structured simulated user must possess consistent traits. An actor behaving as an expert one moment and a confused novice the next yields unreliable evaluation data. Consistency encompasses maintaining the same communication style, expertise level, and personality traits throughout the conversation.
Equally vital is adopting goal-driven behavior. Real users approach agents to accomplish specific tasks and persist until they achieve their goals, adapting their methods if challenges arise. A lack of explicit goals can lead to simulated users either cutting conversations short or dragging them out indefinitely, neither of which reflects actual usage patterns.
Moreover, simulated users must respond adaptively to the agent’s messages instead of adhering to a scripted flow. If the agent poses a clarifying question, for instance, the simulated user should respond in character. This adaptability transforms simulated conversations into valuable evaluation datasets, mirroring the conversational dynamics agents encounter in real-world applications.
Introducing ActorSimulator
ActorSimulator addresses these simulation needs through a sophisticated system that wraps a Strands Agent, emulating a realistic user persona. The process begins with generating a profile: given a test case with an input query and an optional task description, ActorSimulator employs an LLM to create a comprehensive actor profile. For example, for the request, “I need help booking a flight to Paris” defined as “Complete flight booking under budget,” it might generate a budget-conscious traveler with a beginner’s experience level and a relaxed communication style.
Once the profile is established, the simulator manages the conversation turn by turn, incorporating the full conversation history and generating contextually pertinent responses while keeping the simulated user’s behavior in line with their defined profile and objectives. If the agent only partially addresses a request, the simulated user naturally follows up to seek the missing information.
Goal Tracking and Adaptive Responses
ActorSimulator also includes goal completion assessment tools, allowing the simulated user to evaluate whether their original task remains unmet. Conversations conclude once a goal is satisfied or the simulated user determines that their request cannot be fulfilled—or when the maximum turn limit is reached. This design ensures natural conversation endpoints, avoiding arbitrary cuts or endless dialogue.
Each user message is accompanied by structured reasoning, providing insights into the simulated user’s thought process. This transparency facilitates easier evaluation development, allowing evaluators to trace why conversations succeed or falter.
Getting Started with ActorSimulator
To initiate your journey with ActorSimulator, install the Strands Evaluation SDK via:
pip install strands-agents-evals
Refer to our documentation or our prior blog for a detailed setup. Implementing these principles requires minimal coding. Simply define a test case containing an input query and a task description that outlines the user’s goal. ActorSimulator will autonomously manage profile generation, conversation handling, and goal tracking.
Here’s a quick example evaluating a travel assistant through a simulated multi-turn conversation:
from strands import Agent
from strands_evals import ActorSimulator, Case
# Define your test case
case = Case(
input="I want to plan a trip to Tokyo with hotel and activities",
metadata={"task_description": "Complete travel package arranged"}
)
# Create the agent you want to evaluate
agent = Agent(
system_prompt="You are a helpful travel assistant.",
callback_handler=None
)
# Create user simulator from the test case
user_sim = ActorSimulator.from_case_for_user_simulator(
case=case,
max_turns=5
)
# Run the multi-turn conversation
user_message = case.input
conversation_history = []
while user_sim.has_next():
agent_response = agent(user_message)
agent_message = str(agent_response)
conversation_history.append({"role": "assistant", "content": agent_message})
user_result = user_sim.act(agent_message)
user_message = str(user_result.structured_output.message)
conversation_history.append({"role": "user", "content": user_message})
print(f"Conversation completed in {len(conversation_history) // 2} turns")
The conversation continues until has_next() returns False, indicating the simulated user’s goals have been met or the conversation limit has been reached. The resulting conversation_history retains the complete multi-turn transcript for evaluation.
Integrating with Evaluation Pipelines
While a standalone conversation loop proves useful for rapid experimentation, production-grade evaluation necessitates capturing traces and incorporating them into your evaluator pipeline. The following example synergizes ActorSimulator with OpenTelemetry collection and Strands Evals session mapping, allowing the task function to simulate a conversation while harvesting spans at each turn.
Custom Actor Profiles for Focused Testing
Although automatic profile generation suffices for most scenarios, certain testing objectives demand specialized personas. You might want to evaluate how your agent handles an impatient expert user versus a patient novice or address the needs of a user with specific domain expertise. For such cases, ActorSimulator permits the installation of fully tailored actor profiles.
from strands_evals.types.simulation import ActorProfile
from strands_evals import ActorSimulator
actor_profile = ActorProfile(
traits={
"personality": "analytical and detail-oriented",
"communication_style": "direct and technical",
"expertise_level": "expert",
"patience_level": "low"
},
context="Experienced business traveler with elite status who values efficiency",
actor_goal="Book business class flight with specific seat preferences and lounge access"
)
# Initialize the simulator with a custom profile
user_sim = ActorSimulator(
actor_profile=actor_profile,
initial_query="I need to book a business class flight to London next Tuesday",
max_turns=10
)
Defining traits such as patience level, communication style, and expertise allows systematic testing of how your agent performs across diverse user profiles. A high-performing agent with novice users but struggling with impatient experts highlights specific opportunities for improvement.
Best Practices for Simulation-Based Evaluation
To maximize your gains from simulation-based evaluation, consider these best practices:
-
Determine max_turns based on task complexity: Use 3-5 for straightforward tasks and 8-10 for complex workflows. Adjust if conversations frequently hit the limit without achieving goals.
-
Craft precise task descriptions: Generic descriptions like "Help the user book a flight" lack clarity. Vague target criteria prevent reliable goal assessment.
-
Leverage auto-generated profiles, while also defining custom profiles to reproduce patterns in your logs, like interactions with impatient experts.
-
Analyze broader trends across your testing suite: Consistent redirection could indicate an agent drifting off-topic, while persistent goal completion declines post-agent updates might signal regressions.
-
Start with a focused test case set and expand gradually: Begin with representative scenarios before branching out into edge cases and diverse personas as evaluation practices evolve.
Conclusion
In this blog post, we illustrated how ActorSimulator within Strands Evals provides a robust framework for systematic, multi-turn evaluation of conversational AI agents through realistic user simulations. Rather than relying on static test cases that capture isolated interactions, you can define goals and personas, enabling simulated users to engage in dynamic, adaptive conversations with your agent. The resulting transcripts can seamlessly feed into your existing evaluation pipeline, providing valuable scores, success rates, and comprehensive traces for each interaction turn.
To begin experimenting, check out our working examples in the Strands Agents samples repository. For teams using Amazon Bedrock AgentCore, our evaluations sample demonstrates how to simulate interactions with deployed agents. Start with a few test cases that embody your most common user scenarios, run them through ActorSimulator, and evaluate the outputs. Over time, as your evaluation practices mature, broaden your coverage to include varied personas, edge cases, and intricate conversation patterns.
Meet the Authors
Ishan Singh
Ishan is a Sr. Applied Scientist at Amazon Web Services, specializing in generative AI solutions. Outside work, he enjoys volleyball and exploring trails with his dog, Beau.
Jonathan Buck
A Senior Software Engineer at AWS, Jonathan focuses on agent environments and evaluation infrastructure.
Vinayak Arannil
Vinayak, a Sr. Applied Scientist from the Amazon Bedrock AgentCore team, has extensive experience in various AI domains.
Abhishek Kumar
Abhishek is an Applied Scientist at AWS, engaged with agent observability, simulation, and evaluation, previously contributing to Alexa’s core capabilities.
By adopting these practices, you can elevate your AI agent evaluation efforts and ensure they meet user expectations effectively.