Integrating Amazon SageMaker Unified Studio with S3 for Enhanced Visual Question Answering Using Llama 3.2

Overview of the Integration

Last year, AWS announced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, streamlining the use of unstructured data in machine learning.

Fine-Tuning Llama 3.2 for Visual Question Answering

This article covers how to integrate S3 buckets with Amazon SageMaker to fine-tune Llama 3.2 for visual question answering tasks.

Utilizing Amazon SageMaker JumpStart

We utilize Amazon SageMaker JumpStart to access the Llama 3.2 model, achieving a baseline ANLS score of 85.3% on the DocVQA dataset.

Enhancing Model Performance

To improve accuracy, we fine-tune the model using varying dataset sizes from the DocVQA dataset.

End-to-End Workflow with Amazon SageMaker

This section outlines the complete workflow, including data ingestion, model training, and evaluation, facilitated by Amazon SageMaker Unified Studio.

Prerequisites for Setup

Ensure you meet the necessary prerequisites for integrating Amazon SageMaker with S3.

Reference Architecture

Visualize the architecture used throughout this integration and analyze the high-level steps involved.

Detailed Solution Walkthrough

A step-by-step guide using the DocVQA dataset, showcasing the integration process.

Clean-Up Procedures

Instructions for resource deletion to avoid ongoing charges after the demo.

Conclusions and Next Steps

Summarizing performance improvements and exploring future opportunities for optimization and scaling.

About the Authors

Learn more about Hazim Qudah, an AI/ML Specialist Solutions Architect at AWS, and his journey in technology consulting.

Unlocking the Power of Unstructured Data: Fine-Tuning Llama 3.2 for Visual Question Answering with Amazon SageMaker and S3

Last year, AWS made significant strides by integrating Amazon SageMaker Unified Studio with Amazon Simple Storage Service (S3) general-purpose buckets. This integration streamlines the process for teams to leverage unstructured data stored in S3 for machine learning (ML) and data analytics purposes. In this blog post, we will explore how to harness this capability to fine-tune the Llama 3.2 11B Vision Instruct model for visual question answering (VQA).

What You’ll Learn

We will guide you through integrating S3 general-purpose buckets with Amazon SageMaker to enhance the performance of a large language model (LLM). Using a practical example, we will demonstrate how to ask the model a question based on an input image, such as identifying the transaction date from an itemized receipt.

For this demonstration, we’ll utilize Amazon SageMaker JumpStart to access the Llama 3.2 11B model. This model has an impressive baseline performance, achieving an Average Normalized Levenshtein Similarity (ANLS) score of 85.3% on the DocVQA dataset. However, we aim to further improve these metrics through fine-tuning with varying dataset sizes (1,000, 5,000, and 10,000 images) sourced from Hugging Face’s DocVQA dataset.

Prerequisites

To get started, ensure you complete the following prerequisites:

Create an AWS Account: If you don’t have one, set up your AWS account.
Set Up Amazon SageMaker Unified Studio: Create a domain using the quick setup option.
Create Two Projects:
- Data Producer Persona: For discovering and cataloging the dataset in an S3 bucket.
- Data Consumer Persona: For consuming the dataset to fine-tune the LLM.
Access to a Running SageMaker Managed MLflow Application: For experimentation and evaluation purposes.
Pre-Populated Amazon S3 Bucket: Populate this with the raw DocVQA dataset.
Service Quota Increase: Request to use p4de.24xlarge compute for training jobs.

Architecture Overview

The architecture for this project can be broken down into six high-level steps:

Create and configure an IAM access role for Amazon S3 bucket permissions.
Discover and catalog the dataset in the data producer project.
Enrich the dataset with optional metadata and publish it to the SageMaker Catalog.
Subscribe to the published dataset in the data consumer project.
Preprocess the data to create training datasets of varying sizes.
Utilize MLflow to track experimentation and evaluate results.

Solution Walkthrough

In this example, we will work with the DocVQA dataset, which can be any unstructured data relevant to your ML use case, such as chat logs, internal documents, product reviews, etc.

Step 1: Load and Sync Data

First, we use the Datasets API from Hugging Face to load and save the relevant dataset for our task:

import os
from datasets import load_dataset

# Create data directory
os.makedirs("data", exist_ok=True)

# Load and save train split
train_data = load_dataset("HuggingFaceM4/DocumentVQA", split="train[:10000]", cache_dir="./data")
train_data.save_to_disk("data/train")

Once we’ve loaded the dataset, we synchronize it to our S3 bucket.

Step 2: Discover and Catalog the Dataset

After synchronizing the dataset, we need to add access to our data and catalog it. By navigating to the Data section in our data producer project, we can add our S3 bucket location.

# Add Amazon S3 location
# (Provide the bucket name and access role)

Step 3: Publish to SageMaker Catalog

Once your dataset is cataloged, you can publish it for consumption.

Step 4: Data Consumer Project

Switching to the data consumer project, you’ll subscribe to the dataset, enabling your team to access it for ML model development.

Step 5: Model Development Workflow

Now, let’s begin the model development process, which consists of fetching the dataset, preparing it for fine-tuning, and proceeding with training.

Accessing Amazon S3 Data

You’ll configure temporary access credentials and sync the data using the AWS CLI.

aws s3 sync {S3_BUCKET_NAME} ./ --profile access-grants-consumer-access-profile

Step 6: Fine-Tuning the LLM

With the preprocessed dataset ready, you can initiate the fine-tuning process using SageMaker JumpStart.

def train(name, instance_type, training_data_path, experiment_name, run):
    ...

You can evaluate your fine-tuned models using the ANLS metric, which measures the accuracy of the predicted answers against the ground truth answers.

Results and Clean-Up

After testing your models, you’ll review the ANLS scores to identify improvements. Finally, don’t forget to clean up by deleting the resources you created to avoid ongoing charges.

Conclusion

In conclusion, we have effectively illustrated how the integration between Amazon SageMaker Unified Studio and S3 general-purpose buckets simplifies the journey from unstructured data to high-performing ML models. The direct relationship between dataset size and ANLS improvement underscores the power of fine-tuning.

As a next step, consider exploring additional dataset preprocessing techniques or experimenting with different model architectures available through SageMaker JumpStart to maximize your performance outcomes.

For the complete solution code used in this blog post, please refer to this GitHub repository.

About the Author

Hazim Qudah is an AI/ML Specialist Solutions Architect at Amazon Web Services. In his role, he assists customers in building and adopting AI/ML solutions using AWS technologies. When he’s not doing that, you can find him running or playing with his dogs, Nala and Chai!

This post serves as a comprehensive guide, empowering teams to leverage existing data in S3 for machine learning tasks efficiently. Happy fine-tuning!

Exclusive Content:

Enhancing LLM Fine-Tuning with Unstructured Data through SageMaker Unified Studio and S3