Introducing Multimodal Evaluators: Enhancing Image-to-Text Assessment in Strands Evals
Unlocking the Power of Automated Image-Grounded Evaluation
In the era of multimodal AI, relying solely on text-based evaluation methods leaves significant gaps in validation. This post dives into the newly launched multimodal evaluators designed to address the unique challenges of visual shopping, document understanding, and chart analysis. With these tools, you can ensure your models perform accurately by grounding their responses in image data.
Key Features of Our New Evaluators
- Overall Quality
- Correctness
- Faithfulness
- Instruction Following
These evaluators provide precise metrics for assessing image-to-text tasks, enabling you to detect visual hallucinations, factual errors, and adherence to constraints effortlessly.
Step-by-Step Guide to Integration
- Set Up: Learn how to implement the evaluators in your existing workflow seamlessly.
- Evaluate: Switch with ease between reference-based and reference-free methods.
- Customize: Develop domain-specific rubrics for tailored evaluation.
Recommendations & Best Practices
Discover our practical tips for maximizing the efficacy of your evaluations, including selecting the right judge model and effective prompt-design strategies.
Conclusion
As we stand at the forefront of multimodal AI advancements, these evaluators offer a vital leap towards reliable and automated performance assessments. Begin your journey toward enhanced image-to-text evaluations with Strands Evals today!
Grounding AI Evaluation in Reality: Introducing Multimodal Evaluators for Image-to-Text Tasks
In the rapidly evolving landscape of AI, ensuring that models produce reliable and accurate outputs is paramount, especially for applications in visual shopping, image understanding, and document analysis. As Gartner predicts that by 2030, 80% of enterprise software will be multimodal, we face a significant challenge in verifying the outputs generated by these systems.
The Limitations of Text-Only Evaluators
Imagine you’ve developed a model designed to read invoices or summarize complex charts. If you rely on a text-only evaluator, you may receive positive feedback on the fluency and structure of its output. However, this method is fundamentally flawed: it can overlook essential details such as whether a caption accurately represents an image or if an extracted figure matches the source document. Key failures may include:
- Misinterpretation of data, such as naming a trend that the chart does not reflect.
- Hallucinated elements, such as products or labels that aren’t actually present.
- Responses that veer off the intended question or format.
A text-only evaluator lacks the ability to see the image, missing out completely on verifying the essential elements that ground its output in reality.
New Multimodal Evaluators: A Practical Solution
Today, we’re excited to announce the addition of four new multimodal large language model (MLLM)-as-a-Judge evaluators within the Strands Evals Software Development Kit (SDK): Overall Quality, Correctness, Faithfulness, and Instruction Following. These evaluators provide a robust way to score image-to-text outputs based on the actual content of the source image.
Each evaluator scores responses against the original image, facilitating a more accurate assessment. The evaluator feeds the image, query, response, and an optional reference answer to a multimodal judge model, which then provides a grounded score along with rationales for further debugging. This significantly enhances the efficiency of your workflow, allowing for integration with CI to automatically catch visual hallucinations, factual inaccuracies, and deviations from instructions.
What You Will Learn
In this blog post, you will discover how to:
- Set up and implement the four multimodal evaluators in an image-to-text task.
- Toggle between reference-based and reference-free evaluation.
- Create custom multimodal rubrics tailored to specific domain needs.
- Select a judge model from Amazon Bedrock that balances cost, accuracy, and latency.
- Utilize effective prompt-design choices that enhance the alignment of judge outputs with human evaluations.
Evaluators at a Glance
Each of our four evaluators targets different failure modes in image-to-text tasks:
| Evaluator | Score | Core Question | What It Catches |
|---|---|---|---|
| Overall Quality | Likert 1-5 | How good is the response? | Poor relevance, inaccuracies, lack of depth |
| Correctness | Binary | Is the response factually correct? | Factual errors, omissions, wrong attributes |
| Faithfulness | Binary | Is the response grounded in the image? | Hallucinated objects, unsupported claims |
| Instruction Following | Binary | Does the response adhere to instructions? | Format errors, off-topic content |
Each evaluator supports both reference-based options (comparing against a gold standard) and reference-free assessments, allowing versatility in evaluating live images without ground truth.
Step-by-Step: Evaluating a Chart-Reading Task
Step 1: Define the Case and Evaluators
You’ll begin by defining a case that wraps an image and instruction into a MultimodalInput, activating reference-based judging when expected outputs are provided.
from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import (
MultimodalOverallQualityEvaluator,
MultimodalCorrectnessEvaluator,
MultimodalFaithfulnessEvaluator,
MultimodalInstructionFollowingEvaluator,
)
from strands_evals.types import ImageData, MultimodalInput
cases = [
Case[MultimodalInput, str](
name="revenue-chart-1",
input=MultimodalInput(
media=ImageData(source="revenue_chart.jpeg"),
instruction="Which region has the highest average revenue? "
"State the region name and the dollar amount shown in the chart.",
),
expected_output="U.S. and Canada has the highest at $13.32.",
metadata={"dataset": "ChartQA"},
),
]
evaluators = [
MultimodalOverallQualityEvaluator(),
MultimodalCorrectnessEvaluator(),
MultimodalFaithfulnessEvaluator(),
MultimodalInstructionFollowingEvaluator(),
]
Step 2: Run the Experiment
You can wire up the task to execute the evaluation. This step involves running your model on the input image and instruction.
agent = Agent(callback_handler=None)
task_output = None
def run_task(case):
global task_output
image = case.input.media
messages = [
{"image": {"format": image.format or "png", "source": {"bytes": image.to_bytes()}}},
{"text": case.input.instruction},
]
task_output = str(agent(messages))
return task_output
reports = await Experiment(cases=cases, evaluators=evaluators).run_evaluations_async(
task=run_task, max_workers=1,
)
Step 3: Analyze the Results
After executing the evaluations, inspect the results to determine how well the model performed.
print(f"Task Output:\n{task_output}\n")
print("=" * 50)
for name, report in zip(
["Quality", "Correctness", "Faithfulness", "Instruction"], reports,
):
reason = report.reasons[0] if report.reasons else ""
status = "PASS" if report.test_passes[0] else "FAIL"
print(f"{name}: {report.scores[0]:.2f} [{status}]")
print(f" Reason: {reason}\n")
Best Practices for Effective Evaluation
- Start with MultimodalOverallQualityEvaluator for quick insights, then integrate targeted evaluators based on findings.
- Select Claude Sonnet 4.6 as your judge model unless specific constraints dictate otherwise.
- Maintain the reason-and-score format for effective debugging, as it enhances human alignment significantly.
- Use references for correctness, faithfulness, and overall quality—skip them for instruction adherence evaluations.
Conclusion
The introduction of the four MLLM-as-a-Judge evaluators in Strands Evals dramatically enhances the accuracy and reliability of image-to-text evaluations. By providing grounded assessments through Overall Quality, Correctness, Faithfulness, and Instruction Following, we are paving the way for more robust multimodal evaluation, moving beyond costly human review and unreliable text-only proxies.
To get started, install Strands Evals and begin your journey into more precise AI evaluation:
pip install strands-agents-evals
For further resources and tools, check out the additional materials linked below.
Authored by Sangmin Woo, Sungyeon Kim, Vinayak Arannil, and Haibo Ding—experts in applied AI and machine learning frameworks at AWS.