Accelerating Generative AI Adoption: Leveraging Amazon Bedrock and W&B Weave for Enterprise Solutions

Co-authored by Thomas Capelle and Ray Strickland from Weights & Biases (W&B)

Discover how organizations can enhance their AI capabilities with advanced tools for development, evaluation, and monitoring.

Accelerating Enterprise AI with W&B and Amazon Bedrock

This post is co-written by Thomas Capelle and Ray Strickland from Weights & Biases (W&B).

Generative artificial intelligence (AI) is rapidly evolving, moving from basic foundational model interactions to advanced workflows that are becoming integral in enterprise operations. As organizations transition from proof-of-concept stages to full-scale deployments, robust tools for the development, evaluation, and monitoring of AI applications are essential.

In this post, we showcase how to leverage Foundation Models (FMs) from Amazon Bedrock along with the recently launched Amazon Bedrock AgentCore, in conjunction with W&B Weave to build, evaluate, and monitor enterprise AI solutions effectively. We will cover the entire development lifecycle, from tracking individual FM calls to overseeing complex agent workflows in production.

Overview of W&B Weave

Weights & Biases (W&B) is an innovative developer system providing essential tools for training models, fine-tuning, and utilizing foundation models suited for enterprises across a plethora of industries.

Key Features of W&B Weave

Tracing & Monitoring: Track large language model (LLM) calls and application logic to debug and analyze your production systems efficiently.
Systematic Iteration: Refine prompts, datasets, and models seamlessly.
Experimentation: Test various models and prompts using the LLM Playground.
Evaluation: Utilize custom or pre-built evaluation tools to assess application performance rigorously, and gather real-world user and expert feedback.
Guardrails: Safeguard your application with content moderation and prompt safety tools, utilizing both custom and third-party guardrails, including those from Amazon Bedrock.

W&B Weave is available in both multi-tenant and single-tenant environments and can also be deployed directly in a customer’s Amazon Virtual Private Cloud (VPC). Through its integration with the W&B Development Platform, it offers an uninterrupted experience between model training workflows and agentic AI workflows.

To get started, subscribe to the W&B AI Development Platform on AWS Marketplace, which is free for individuals and academic teams.

Tracking Amazon Bedrock FMs with W&B Weave SDK

W&B Weave integrates smoothly with Amazon Bedrock via Python and TypeScript SDKs. With just a few lines of code, you can automatically track LLM calls. Here’s a quick setup:

!pip install weave
import weave
import boto3
import json
from weave.integrations.bedrock.bedrock_sdk import patch_client

weave.init("my_bedrock_app")

client = boto3.client("bedrock-runtime")
patch_client(client)

response = client.invoke_model(
    modelId="anthropic.claude-3-5-sonnet-20240620-v1:0",
    body=json.dumps({
        "anthropic_version": "bedrock-2023-05-31",
        "max_tokens": 100,
        "messages": [
            {"role": "user", "content": "What is the capital of France?"}
        ]
    }),
    contentType="application/json",
    accept="application/json"
)

response_dict = json.loads(response.get('body').read())
print(response_dict["content"][0]["text"])

This integration not only automates versioning and tracks configurations but also provides complete visibility into your applications without needing to alter existing logic.

Experimenting with Amazon Bedrock FMs in W&B Weave Playground

The W&B Weave Playground is designed to expedite prompt engineering through an intuitive interface that facilitates testing and comparison of Bedrock models. Notable features include:

Directly editing prompts and retrying messages.
Side-by-side model comparisons.
Quick iterations accessible from trace views.

To dive in, simply add your AWS credentials in the Playground settings, select your Amazon Bedrock FMs, and begin exploring. The platform enables rapid experimentation while ensuring full traceability of your efforts.

Evaluating Amazon Bedrock FMs with W&B Weave Evaluations

Effective evaluation of generative AI models is simplified with W&B Weave Evaluations. By utilizing these tools with Amazon Bedrock, users can evaluate models, analyze outputs, and visualize performance across various metrics. Leveraging built-in scorers, third-party or custom scorers, and expert feedback allows for a better understanding of model trade-offs such as cost, accuracy, and output quality.

Setting Up an Evaluation Job

Here’s an example of how to orchestrate an evaluation job:

import weave
from weave import Evaluation
import asyncio

# Define examples for evaluation
examples = [
    {"question": "What is the capital of France?", "expected": "Paris"},
    {"question": "Who wrote 'To Kill a Mockingbird'?", "expected": "Harper Lee"},
]

@weave.op()
def match_score1(expected: str, output: dict) -> dict:
    return {'match': expected == output['generated_text']}

@weave.op()
def function_to_evaluate(question: str):
    return {'generated_text': 'Paris'}

evaluation = Evaluation(
    dataset=examples, scorers=[match_score1]
)

weave.init('intro-example')
asyncio.run(evaluation.evaluate(function_to_evaluate))

The evaluation dashboard visualizes performance metrics, guiding informed decisions regarding model selection and configurations.

Enhancing Amazon Bedrock AgentCore Observability with W&B Weave

Amazon Bedrock AgentCore provides enterprise-scale operational controls and runtime environments. It integrates seamlessly with popular frameworks to ensure secure and efficient agent operations. Built-in observability through Amazon CloudWatch helps track essential metrics while W&B Weave enhances this capability with detailed execution visualizations.

Integrating W&B Weave Observability

There are two effective ways to add observability to your AgentCore agents:

Native W&B Weave SDK: By using the @weave.op decorator, tracking function calls becomes automatic.

import weave
import os

os.environ["WANDB_API_KEY"] = "your_api_key"
weave.init("your_project_name")

@weave.op()
def word_count_op(text: str) -> int:
    return len(text.split())

@weave.op()
def run_agent(agent: Agent, user_message: str) -> dict:
    result = agent(user_message)
    return {"message": result.message, "model": agent.model.config["model_id"]}

OpenTelemetry Integration: For users with existing OpenTelemetry infrastructures, W&B Weave supports OTLP for seamless integration.

from opentelemetry import trace
from opentelemetry.exporter.otlp.proto.http.trace_exporter import OTLPSpanExporter

exporter = OTLPSpanExporter(...)

with tracer.start_as_current_span("invoke_agent") as span:
    ...

This dual observability approach allows teams to utilize both CloudWatch and W&B Weave, giving them the flexibility to monitor production SLAs, debug complex behaviors, and optimize performance across their workflows.

Conclusion

In this post, we’ve illustrated how to effectively build and optimize enterprise-grade agentic AI solutions by integrating Amazon Bedrock’s FMs and AgentCore with W&B Weave’s observability toolkit. We explored the extensive capabilities of W&B Weave to enhance the LLM development lifecycle from experimentation to evaluation and production monitoring.

Key Takeaways:

Automatic Tracking: Easily track Amazon Bedrock FM calls with minimal code adjustments using the W&B Weave SDK.
Rapid Experimentation: Engage with the Playground for quick testing and model comparisons.
Systematic Evaluation: Compare model performance through the Evaluation framework.
Comprehensive Observability: Blend CloudWatch monitoring with W&B Weave’s visualization tools for enhanced operational insights.

To kickstart your journey:

Request a free trial or subscribe to W&B’s AI Development Platform via AWS Marketplace.
Install the W&B Weave SDK to start tracking your Bedrock FM calls.
Experiment with models in the W&B Weave Playground.
Set up evaluations for comprehensive model comparisons.
Enhance AgentCore agents with observability features.

Begin with simple integrations and progressively adopt advanced features as your AI applications evolve. The synergy between Amazon Bedrock’s capabilities and W&B Weave provides the foundation for building, evaluating, and maintaining robust AI solutions at scale.

About the Authors

James Yi is a Senior AI/ML Partner Solutions Architect at AWS, collaborating with teams to design advanced joint solutions in the generative AI space.

Ray Strickland is a Senior Partner Solutions Architect at AWS specializing in AI/ML and intelligent document processing, driving innovation and scalable solutions.

Thomas Capelle is a Machine Learning Engineer at W&B, focusing on ML Ops and application developments to enhance enterprise performance.

Scott Juang is the Director of Alliances at W&B, previously leading strategic partnerships in cloud technologies.

This collaborative and innovative landscape of AI development promises to yield exciting solutions that empower organizations to transform their operational capabilities. Dive in and discover the potential awaiting your enterprise!

Exclusive Content:

Speed Up Enterprise AI Development with Weights & Biases and Amazon Bedrock AgentCore