Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Enhancing LLM Fine-Tuning with Unstructured Data through SageMaker Unified Studio and S3

Integrating Amazon SageMaker Unified Studio with S3 for Enhanced Visual Question Answering Using Llama 3.2

Overview of the Integration

Last year, AWS announced an integration between Amazon SageMaker Unified Studio and Amazon S3 general purpose buckets, streamlining the use of unstructured data in machine learning.

Fine-Tuning Llama 3.2 for Visual Question Answering

This article covers how to integrate S3 buckets with Amazon SageMaker to fine-tune Llama 3.2 for visual question answering tasks.

Utilizing Amazon SageMaker JumpStart

We utilize Amazon SageMaker JumpStart to access the Llama 3.2 model, achieving a baseline ANLS score of 85.3% on the DocVQA dataset.

Enhancing Model Performance

To improve accuracy, we fine-tune the model using varying dataset sizes from the DocVQA dataset.

End-to-End Workflow with Amazon SageMaker

This section outlines the complete workflow, including data ingestion, model training, and evaluation, facilitated by Amazon SageMaker Unified Studio.

Prerequisites for Setup

Ensure you meet the necessary prerequisites for integrating Amazon SageMaker with S3.

Reference Architecture

Visualize the architecture used throughout this integration and analyze the high-level steps involved.

Detailed Solution Walkthrough

A step-by-step guide using the DocVQA dataset, showcasing the integration process.

Clean-Up Procedures

Instructions for resource deletion to avoid ongoing charges after the demo.

Conclusions and Next Steps

Summarizing performance improvements and exploring future opportunities for optimization and scaling.

About the Authors

Learn more about Hazim Qudah, an AI/ML Specialist Solutions Architect at AWS, and his journey in technology consulting.

Unlocking the Power of Unstructured Data: Fine-Tuning Llama 3.2 for Visual Question Answering with Amazon SageMaker and S3

Last year, AWS made significant strides by integrating Amazon SageMaker Unified Studio with Amazon Simple Storage Service (S3) general-purpose buckets. This integration streamlines the process for teams to leverage unstructured data stored in S3 for machine learning (ML) and data analytics purposes. In this blog post, we will explore how to harness this capability to fine-tune the Llama 3.2 11B Vision Instruct model for visual question answering (VQA).

What You’ll Learn

We will guide you through integrating S3 general-purpose buckets with Amazon SageMaker to enhance the performance of a large language model (LLM). Using a practical example, we will demonstrate how to ask the model a question based on an input image, such as identifying the transaction date from an itemized receipt.

For this demonstration, we’ll utilize Amazon SageMaker JumpStart to access the Llama 3.2 11B model. This model has an impressive baseline performance, achieving an Average Normalized Levenshtein Similarity (ANLS) score of 85.3% on the DocVQA dataset. However, we aim to further improve these metrics through fine-tuning with varying dataset sizes (1,000, 5,000, and 10,000 images) sourced from Hugging Face’s DocVQA dataset.

Prerequisites

To get started, ensure you complete the following prerequisites:

  1. Create an AWS Account: If you don’t have one, set up your AWS account.
  2. Set Up Amazon SageMaker Unified Studio: Create a domain using the quick setup option.
  3. Create Two Projects:
    • Data Producer Persona: For discovering and cataloging the dataset in an S3 bucket.
    • Data Consumer Persona: For consuming the dataset to fine-tune the LLM.
  4. Access to a Running SageMaker Managed MLflow Application: For experimentation and evaluation purposes.
  5. Pre-Populated Amazon S3 Bucket: Populate this with the raw DocVQA dataset.
  6. Service Quota Increase: Request to use p4de.24xlarge compute for training jobs.

Architecture Overview

The architecture for this project can be broken down into six high-level steps:

  1. Create and configure an IAM access role for Amazon S3 bucket permissions.
  2. Discover and catalog the dataset in the data producer project.
  3. Enrich the dataset with optional metadata and publish it to the SageMaker Catalog.
  4. Subscribe to the published dataset in the data consumer project.
  5. Preprocess the data to create training datasets of varying sizes.
  6. Utilize MLflow to track experimentation and evaluate results.

Solution Walkthrough

In this example, we will work with the DocVQA dataset, which can be any unstructured data relevant to your ML use case, such as chat logs, internal documents, product reviews, etc.

Step 1: Load and Sync Data

First, we use the Datasets API from Hugging Face to load and save the relevant dataset for our task:

import os
from datasets import load_dataset

# Create data directory
os.makedirs("data", exist_ok=True)

# Load and save train split
train_data = load_dataset("HuggingFaceM4/DocumentVQA", split="train[:10000]", cache_dir="./data")
train_data.save_to_disk("data/train")

Once we’ve loaded the dataset, we synchronize it to our S3 bucket.

Step 2: Discover and Catalog the Dataset

After synchronizing the dataset, we need to add access to our data and catalog it. By navigating to the Data section in our data producer project, we can add our S3 bucket location.

# Add Amazon S3 location
# (Provide the bucket name and access role)

Step 3: Publish to SageMaker Catalog

Once your dataset is cataloged, you can publish it for consumption.

Step 4: Data Consumer Project

Switching to the data consumer project, you’ll subscribe to the dataset, enabling your team to access it for ML model development.

Step 5: Model Development Workflow

Now, let’s begin the model development process, which consists of fetching the dataset, preparing it for fine-tuning, and proceeding with training.

Accessing Amazon S3 Data

You’ll configure temporary access credentials and sync the data using the AWS CLI.

aws s3 sync {S3_BUCKET_NAME} ./ --profile access-grants-consumer-access-profile

Step 6: Fine-Tuning the LLM

With the preprocessed dataset ready, you can initiate the fine-tuning process using SageMaker JumpStart.

def train(name, instance_type, training_data_path, experiment_name, run):
    ...

You can evaluate your fine-tuned models using the ANLS metric, which measures the accuracy of the predicted answers against the ground truth answers.

Results and Clean-Up

After testing your models, you’ll review the ANLS scores to identify improvements. Finally, don’t forget to clean up by deleting the resources you created to avoid ongoing charges.

Conclusion

In conclusion, we have effectively illustrated how the integration between Amazon SageMaker Unified Studio and S3 general-purpose buckets simplifies the journey from unstructured data to high-performing ML models. The direct relationship between dataset size and ANLS improvement underscores the power of fine-tuning.

As a next step, consider exploring additional dataset preprocessing techniques or experimenting with different model architectures available through SageMaker JumpStart to maximize your performance outcomes.

For the complete solution code used in this blog post, please refer to this GitHub repository.

About the Author

Hazim Qudah is an AI/ML Specialist Solutions Architect at Amazon Web Services. In his role, he assists customers in building and adopting AI/ML solutions using AWS technologies. When he’s not doing that, you can find him running or playing with his dogs, Nala and Chai!


This post serves as a comprehensive guide, empowering teams to leverage existing data in S3 for machine learning tasks efficiently. Happy fine-tuning!

Latest

The 12 Greatest Space Movies, From Gravity to Solaris

The Ultimate Countdown of Science Fiction's Most Awe-Inspiring Space...

Creating Age-Responsive, Context-Aware AI Using Amazon Bedrock Guardrails

Ensuring Safe and Reliable AI Responses: A Guardrail-First Approach...

Sephora Introduces ChatGPT App to Enhance AI-Driven Beauty Shopping体验

Sephora Launches AI-Powered App in ChatGPT for Personalized Beauty...

European Robotics Forum Set for Birmingham in 2027

European Robotics Forum 2027: Shaping the Future of Robotics...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Run Generative AI Inference Using Amazon Bedrock in Asia Pacific (New...

Amazon Bedrock Now Available in New Zealand: A New Era for Cross-Region Inference Unlocking Foundation Model Access for Kiwi Customers Exciting News for New Zealand: Amazon...

Scaling Video Insights with Amazon Bedrock’s Multimodal Models

Unlocking Video Insights: Harnessing the Power of Amazon Bedrock for Advanced Understanding The Evolution of Video Analysis Three Approaches to Video Understanding Frame-Based Workflow: Precision at Scale Shot-Based...

Deploy SageMaker AI Inference Endpoints with Configured GPU Capacity Using Training...

A Comprehensive Guide to Optimizing Inference Workloads with Amazon SageMaker AI Training Plans Introduction to LLMs and GPU Capacity Challenges Leveraging Amazon SageMaker for Predictable Inference...