Introducing Multimodal Evaluators: Enhancing Image-to-Text Assessment in Strands Evals

Unlocking the Power of Automated Image-Grounded Evaluation

In the era of multimodal AI, relying solely on text-based evaluation methods leaves significant gaps in validation. This post dives into the newly launched multimodal evaluators designed to address the unique challenges of visual shopping, document understanding, and chart analysis. With these tools, you can ensure your models perform accurately by grounding their responses in image data.

Key Features of Our New Evaluators

Overall Quality
Correctness
Faithfulness
Instruction Following

These evaluators provide precise metrics for assessing image-to-text tasks, enabling you to detect visual hallucinations, factual errors, and adherence to constraints effortlessly.

Step-by-Step Guide to Integration

Set Up: Learn how to implement the evaluators in your existing workflow seamlessly.
Evaluate: Switch with ease between reference-based and reference-free methods.
Customize: Develop domain-specific rubrics for tailored evaluation.

Recommendations & Best Practices

Discover our practical tips for maximizing the efficacy of your evaluations, including selecting the right judge model and effective prompt-design strategies.

Conclusion

As we stand at the forefront of multimodal AI advancements, these evaluators offer a vital leap towards reliable and automated performance assessments. Begin your journey toward enhanced image-to-text evaluations with Strands Evals today!

Grounding AI Evaluation in Reality: Introducing Multimodal Evaluators for Image-to-Text Tasks

In the rapidly evolving landscape of AI, ensuring that models produce reliable and accurate outputs is paramount, especially for applications in visual shopping, image understanding, and document analysis. As Gartner predicts that by 2030, 80% of enterprise software will be multimodal, we face a significant challenge in verifying the outputs generated by these systems.

The Limitations of Text-Only Evaluators

Imagine you’ve developed a model designed to read invoices or summarize complex charts. If you rely on a text-only evaluator, you may receive positive feedback on the fluency and structure of its output. However, this method is fundamentally flawed: it can overlook essential details such as whether a caption accurately represents an image or if an extracted figure matches the source document. Key failures may include:

Misinterpretation of data, such as naming a trend that the chart does not reflect.
Hallucinated elements, such as products or labels that aren’t actually present.
Responses that veer off the intended question or format.

A text-only evaluator lacks the ability to see the image, missing out completely on verifying the essential elements that ground its output in reality.

New Multimodal Evaluators: A Practical Solution

Today, we’re excited to announce the addition of four new multimodal large language model (MLLM)-as-a-Judge evaluators within the Strands Evals Software Development Kit (SDK): Overall Quality, Correctness, Faithfulness, and Instruction Following. These evaluators provide a robust way to score image-to-text outputs based on the actual content of the source image.

Each evaluator scores responses against the original image, facilitating a more accurate assessment. The evaluator feeds the image, query, response, and an optional reference answer to a multimodal judge model, which then provides a grounded score along with rationales for further debugging. This significantly enhances the efficiency of your workflow, allowing for integration with CI to automatically catch visual hallucinations, factual inaccuracies, and deviations from instructions.

What You Will Learn

In this blog post, you will discover how to:

Set up and implement the four multimodal evaluators in an image-to-text task.
Toggle between reference-based and reference-free evaluation.
Create custom multimodal rubrics tailored to specific domain needs.
Select a judge model from Amazon Bedrock that balances cost, accuracy, and latency.
Utilize effective prompt-design choices that enhance the alignment of judge outputs with human evaluations.

Evaluators at a Glance

Each of our four evaluators targets different failure modes in image-to-text tasks:

Evaluator	Score	Core Question	What It Catches
Overall Quality	Likert 1-5	How good is the response?	Poor relevance, inaccuracies, lack of depth
Correctness	Binary	Is the response factually correct?	Factual errors, omissions, wrong attributes
Faithfulness	Binary	Is the response grounded in the image?	Hallucinated objects, unsupported claims
Instruction Following	Binary	Does the response adhere to instructions?	Format errors, off-topic content

Each evaluator supports both reference-based options (comparing against a gold standard) and reference-free assessments, allowing versatility in evaluating live images without ground truth.

Step-by-Step: Evaluating a Chart-Reading Task

Step 1: Define the Case and Evaluators

You’ll begin by defining a case that wraps an image and instruction into a MultimodalInput, activating reference-based judging when expected outputs are provided.

from strands import Agent
from strands_evals import Case, Experiment
from strands_evals.evaluators import (
    MultimodalOverallQualityEvaluator,
    MultimodalCorrectnessEvaluator,
    MultimodalFaithfulnessEvaluator,
    MultimodalInstructionFollowingEvaluator,
)
from strands_evals.types import ImageData, MultimodalInput

cases = [
    Case[MultimodalInput, str](
        name="revenue-chart-1",
        input=MultimodalInput(
            media=ImageData(source="revenue_chart.jpeg"),
            instruction="Which region has the highest average revenue? "
                        "State the region name and the dollar amount shown in the chart.",
        ),
        expected_output="U.S. and Canada has the highest at $13.32.",
        metadata={"dataset": "ChartQA"},
    ),
]

evaluators = [
    MultimodalOverallQualityEvaluator(),
    MultimodalCorrectnessEvaluator(),
    MultimodalFaithfulnessEvaluator(),
    MultimodalInstructionFollowingEvaluator(),
]

Step 2: Run the Experiment

You can wire up the task to execute the evaluation. This step involves running your model on the input image and instruction.

agent = Agent(callback_handler=None)
task_output = None

def run_task(case):
    global task_output
    image = case.input.media
    messages = [
        {"image": {"format": image.format or "png", "source": {"bytes": image.to_bytes()}}},
        {"text": case.input.instruction},
    ]
    task_output = str(agent(messages))
    return task_output

reports = await Experiment(cases=cases, evaluators=evaluators).run_evaluations_async(
    task=run_task, max_workers=1,
)

Step 3: Analyze the Results

After executing the evaluations, inspect the results to determine how well the model performed.

print(f"Task Output:\n{task_output}\n")
print("=" * 50)
for name, report in zip(
    ["Quality", "Correctness", "Faithfulness", "Instruction"], reports,
):
    reason = report.reasons[0] if report.reasons else ""
    status = "PASS" if report.test_passes[0] else "FAIL"
    print(f"{name}: {report.scores[0]:.2f} [{status}]")
    print(f"  Reason: {reason}\n")

Best Practices for Effective Evaluation

Start with MultimodalOverallQualityEvaluator for quick insights, then integrate targeted evaluators based on findings.
Select Claude Sonnet 4.6 as your judge model unless specific constraints dictate otherwise.
Maintain the reason-and-score format for effective debugging, as it enhances human alignment significantly.
Use references for correctness, faithfulness, and overall quality—skip them for instruction adherence evaluations.

Conclusion

The introduction of the four MLLM-as-a-Judge evaluators in Strands Evals dramatically enhances the accuracy and reliability of image-to-text evaluations. By providing grounded assessments through Overall Quality, Correctness, Faithfulness, and Instruction Following, we are paving the way for more robust multimodal evaluation, moving beyond costly human review and unreliable text-only proxies.

To get started, install Strands Evals and begin your journey into more precise AI evaluation:

pip install strands-agents-evals

For further resources and tools, check out the additional materials linked below.

Authored by Sangmin Woo, Sungyeon Kim, Vinayak Arannil, and Haibo Ding—experts in applied AI and machine learning frameworks at AWS.

Exclusive Content:

Multimodal Evaluators: MLLM as Judges for Image-to-Text Tasks in Strands Evals

Introducing Multimodal Evaluators: Enhancing Image-to-Text Assessment in Strands Evals

Unlocking the Power of Automated Image-Grounded Evaluation

Key Features of Our New Evaluators

Step-by-Step Guide to Integration

Recommendations & Best Practices

Conclusion

Grounding AI Evaluation in Reality: Introducing Multimodal Evaluators for Image-to-Text Tasks

The Limitations of Text-Only Evaluators

New Multimodal Evaluators: A Practical Solution

What You Will Learn

Evaluators at a Glance

Step-by-Step: Evaluating a Chart-Reading Task

Step 1: Define the Case and Evaluators

Step 2: Run the Experiment

Step 3: Analyze the Results

Best Practices for Effective Evaluation

Conclusion

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe