Fine-Tuning an Amazon Nova Model: A Hands-On Guide with Data Mixing Techniques

This guide provides a comprehensive overview for fine-tuning Amazon Nova models using the Nova Forge SDK. It brings together data preparation, training with innovative data mixing strategies, and evaluation, offering a structured playbook for your customization needs. Ideal for anyone looking to improve domain-specific applications while maintaining general model capabilities.

Fine-Tuning an Amazon Nova Model with the Nova Forge SDK

Welcome to the second part of our series on fine-tuning Amazon Nova models using the Nova Forge SDK! In this post, we will delve deep into the crucial technique of data mixing, which allows you to enhance model performance on domain-specific tasks without sacrificing the model’s general capabilities.

Why Data Mixing Matters

In our previous installment, we highlighted how blending customer data with Amazon-curated datasets can significantly enhance model performance. Combining these datasets preserved near-baseline Massive Multitask Language Understanding (MMLU) scores and delivered a remarkable 12-point F1 improvement on a Voice of Customer classification task with 1,420 categories. In contrast, training an open-source model solely on customer data resulted in a significant compromisation of general capabilities.

Today, we’ll guide you through a repeatable workflow for efficiently fine-tuning your Amazon Nova model from data preparation to evaluation, with special emphasis on data mixing.

Solution Overview

This workflow comprises five key stages:

Environment Setup: Install the Nova Forge SDK and configure AWS resources.
Data Preparation: Load, sanitize, transform, validate, and split your training data.
Training Configuration: Set up the SageMaker HyperPod runtime, MLflow tracking, and data mixing ratios.
Model Training: Launch and monitor a supervised fine-tuning job with Low-Rank Adaptation (LoRA).
Model Evaluation: Run public benchmarks and domain-specific tests against the fine-tuned model.

Prerequisites

Before proceeding, ensure you have:

An AWS account with access to Amazon Nova Forge.
A SageMaker HyperPod cluster provisioned with GPU instances (ml.p5.48xlarge recommended).
An Amazon SageMaker MLflow application set up for experiment tracking.
Appropriate IAM permissions for SageMaker, Amazon S3, and Amazon CloudWatch.
A SageMaker Studio notebook or an equivalent Jupyter environment.

Cost Note: High-end GPU instances can be expensive. We suggest starting with a brief validation run (e.g., max_steps=5) before committing to a full training execution.

Step 1: Install the Nova Forge SDK and Dependencies

Begin by downloading the necessary SDK and CLI tools:

curl –O https://github.com/aws-samples/amazon-nova-samples/blob/main/customization/nova-forge-hyperpod-cli-installation/install_hp_cli.sh
bash install_hp_cli.sh

Next, install the Nova Forge SDK and dependencies in your virtual environment:

pip install --upgrade botocore awscli amzn-nova-forge datasets huggingface_hub pandas pyarrow

Activate your virtual environment and set it up in your Jupyter notebook:

source ~/hyperpod-cli-venv/bin/activate
pip install ipykernel
python -m ipykernel install --user --name=hyperpod-cli-venv --display-name="Forge (hyperpod-cli-venv)"

Verify your installation:

from amzn_nova_forge import *
print("SDK imported successfully")

Step 2: Configure AWS Resources

Create an S3 bucket for storing your training data and model outputs. Grant your HyperPod execution role access to this bucket.

Here’s an example to set this up:

import boto3
import time
import json

TIMESTAMP = int(time.time())
S3_BUCKET = f"nova-forge-customisation-{TIMESTAMP}"
s3 = boto3.client("s3")

# Create the S3 bucket
s3.create_bucket(Bucket=S3_BUCKET)

# Grant HyperPod execution role access (update with your ARNs)
bucket_policy = { 
    "Version": "2012-10-17", 
    "Statement": [
        {
            "Sid": "AllowHyperPodAccess", 
            "Effect": "Allow", 
            "Principal": {"AWS": "HYPERPOD_ROLE_ARN"}, 
            "Action": ["s3:*"], 
            "Resource": [ f"arn:aws:s3:::{S3_BUCKET}", f"arn:aws:s3:::{S3_BUCKET}/*" ]
        }
    ]
}

s3.put_bucket_policy(Bucket=S3_BUCKET, Policy=json.dumps(bucket_policy))

Step 3: Prepare Your Training Dataset

The Nova Forge SDK accepts several formats, including JSONL, JSON, and CSV. For this guide, let’s use the MedReason dataset from Hugging Face.

Download and Sanitize the Data

Given the model’s internal chat template, sanitize your data to prevent misinterpretation of tokens. Here’s how to download and sanitize your dataset:

from huggingface_hub import hf_hub_download
import pandas as pd
import json
import re

# Download the dataset
jsonl_path = hf_hub_download(repo_id="UCSC-VLAA/MedReason", filename="ours_quality_33000.jsonl")
df = pd.read_json(jsonl_path, lines=True)

def sanitize_text(text):
    # Adjust token replacements here
    return text.strip()  # Placeholder for your logic

# Create sanitized JSONL
with open("training_data.jsonl", "w") as f:
    for _, row in df.iterrows():
        f.write(json.dumps({"question": sanitize_text(row["question"]), "answer": sanitize_text(row["answer"])}) + "\n")

Next, validate your data with the Nova Forge SDK to ensure it conforms to the expected structure.

Load, Transform, and Validate with the SDK

loader = JSONLDatasetLoader(question="question", answer="answer")
loader.load("training_data.jsonl")
loader.transform(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)
loader.validate(method=TrainingMethod.SFT_LORA, model=Model.NOVA_LITE_2)
train_path = loader.save(f"{S3_DATA_PATH}/train.jsonl")

Step 4: Configure and Launch Training with Data Mixing

Data mixing blends your domain-specific training data with Amazon-curated datasets during fine-tuning, preserving the model’s general intelligence.

Configure Training Methods

We will utilize supervised fine-tuning (SFT) with LoRA, which is more efficient:

customizer = NovaModelCustomizer(
    model=Model.NOVA_LITE_2,
    method=TrainingMethod.SFT_LORA,
    infra=runtime,
    data_s3_path=f"{S3_DATA_PATH}/train.jsonl",
    output_s3_path=f"{S3_OUTPUT_PATH}/",
    mlflow_monitor=mlflow_monitor,
    data_mixing_enabled=True,  # Enables data mixing
)

Tune the Data Mixing Configuration

Adjusting the data mixing ratios is crucial. For example:

customizer.set_data_mixing_config({ 
    "customer_data_percent": 50, 
    ... # Other configurations defined here
})

Launch Training Job

Set key training hyperparameters and launch the job:

training_config = { 
    "lr": 1e-5, 
    "warmup_steps": 2, 
    "global_batch_size": 32, 
    "max_length": 65536,
    "max_steps": 5,
}

training_result = customizer.train(
    job_name="nova-forge-sft-datamix", 
    overrides=training_config,
)

Monitor Training Progress

Track the training status and logs:

print(training_result.get_job_status())
customizer.get_logs(limit=50)

Step 5: Evaluate the Fine-Tuned Model

Evaluation helps determine if the balance between domain performance and general intelligence is maintained post fine-tuning.

Run Evaluations

Utilize various evaluation methods, including public benchmarks and your own test set, to assess model performance:

mmlu_result = evaluator.evaluate(job_name="eval-mmlu", eval_task=EvaluationTask.MMLU, model_path=checkpoint_path)

Check Results and Retrieve Outputs

After evaluations, retrieve S3 paths for detailed results:

print(mmlu_result.eval_output_path)

Best Practices

Start with Default Mixing Ratios: They provide a balanced trade-off initially.
Evaluate on Both Axes: Always conduct MMLU benchmarks alongside domain-specific evaluations.
Utilize MLflow: Compare experiments easily and document successful configurations.
Iterate the Mix: Tuning data mixing often yields more significant improvements than adjusting hyperparameters.
Use LoRA: Its efficiency makes it the preferred starting point; switch to full-rank SFT as needed.

Conclusion

This guide empowers you to effectively fine-tune Amazon Nova models using the Nova Forge SDK with data mixing. With a structured workflow and actionable insights, you can maximize model performance while maintaining its general capabilities.

For a deeper dive, refer to the Nova Forge Developer Guide for comprehensive details and check out the full API reference!

About the Authors

Gideon Teo, Andrew Smith, Timothy Downs, and Krishna Neupane are AWS experts who specialize in AI/ML technologies. They collaborate to assist businesses in leveraging cutting-edge AI solutions for impactful outcomes.

Feel free to reach out for any queries, and stay tuned for the next installment in our series!

Exclusive Content:

Nova Forge SDK Series, Part 2: A Practical Guide to Fine-Tuning Nova Models Through Data Mixing Techniques