Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Crafting Specialized AI While Preserving Intelligence: Nova Forge Data Mixing Unleashed

Enhancing Large Language Models: Addressing Specialized Task Limitations with Supervised Fine-Tuning and Nova Forge

The Challenge of Customer Feedback Classification

Evaluation Methodology

Test Overview

In-Domain Task Evaluation: Voice of Customer Classification

Key Findings and Practical Recommendations

Conclusion

About the Authors

Unlocking the Power of Language Models: Nova Forge’s Solution for Specialized Tasks

Large Language Models (LLMs) have transformed the landscape of artificial intelligence, providing remarkable capabilities for general tasks. However, they often falter in specialized roles that demand a nuanced understanding of proprietary data, internal processes, and industry-specific terminology. This is where Supervised Fine-Tuning (SFT) comes into play, adapting LLMs to meet organizational needs.

The Challenge of Specialization: Balancing Expertise and General Intelligence

SFT can be implemented through two methodologies:

  1. Parameter-Efficient Fine-Tuning (PEFT): This approach updates only a subset of model parameters, allowing for faster training and reduced computational costs while still delivering reasonable performance improvements.

  2. Full-rank SFT: In contrast, this method updates all model parameters, integrating more domain knowledge but often leads to a phenomenon known as catastrophic forgetting. As models specialize, they risk losing foundational capabilities like reasoning, instruction-following, and broad knowledge. This creates a dilemma for organizations: choose between domain expertise or general intelligence, hindering a model’s utility across diverse enterprise use cases.

Enter Nova Forge: A Game Changer for Custom Model Development

Amazon Nova Forge offers a compelling solution to the aforementioned challenges. This service enables organizations to develop custom frontier models using initial model checkpoints, seamlessly blending proprietary data with Amazon Nova-curated training data. Moreover, customers can securely host their custom models on AWS, providing a robust framework for building solutions tailored to unique business needs.

A Real-World Test: Voice of Customer (VOC) Classification Task

To showcase Nova Forge’s efficacy, the AWS China Applied Science team undertook a comprehensive evaluation utilizing a VOC classification task. This task involved classifying over 16,000 customer comments into a meticulously structured four-level label hierarchy encompassing 1,420 leaf categories. The evaluation revealed two key advantages of Nova Forge’s data mixing approach:

  • In-Domain Performance Gains: A significant 17% improvement in F1 scores.
  • Preserved General Capabilities: Maintained near-baseline scores for Massive Multitask Language Understanding (MMLU) and instruction-following abilities post-fine-tuning.

The Challenge: Real-World Customer Feedback Classification

Consider a typical scenario in a large ecommerce company. The customer experience team receives thousands of comments daily that cover feedback on product quality, delivery experiences, and customer service decisions. Efficient operation requires an LLM capable of automatically classifying each comment into actionable categories, ensuring that issues reach the appropriate teams. This demands domain specialization.

Simultaneously, the same LLM must possess:

  • The ability to generate customer-facing responses with effective communication skills.
  • The capability to perform data analysis requiring logical reasoning.
  • The skill to draft documentation in a specific format.

This dual requirement underscores the necessity for an LLM to uphold broad general capabilities while also being specialized.

Evaluation Methodology: Measuring Both Specialization and General Abilities

To assess Nova Forge’s effectiveness, a dual-evaluation framework was established to gauge performance across two dimensions:

  1. Domain-Specific Performance: A VOC dataset reflecting real-world customer reviews was employed, comprising 14,511 training samples and 861 test samples. The extreme class imbalance typical in real-world scenarios added complexity to classification accuracy.

  2. General Purpose Capabilities: Using the public test set of the MMLU benchmark, we measured improvements in domain performance against potential degradation of foundational model behaviors.

In-Domain Task Evaluation

In evaluating Nova Forge’s performance, we observed a notable baseline evaluation for VOC classification tasks. Here’s a snapshot of the comparative performance of the models examined:

Model Precision Recall F1-Score
Nova 2 Lite 0.4596 0.3627 0.387
Qwen3-30B-A3B 0.4567 0.3864 0.394

Both models exhibited comparable performance in finely grained classifications, illustrating the inherent difficulty of the task.

Supervised Fine-Tuning

Model improvement became evident following the application of full-parameter SFT on customer VOC data. Outcomes demonstrated:

Model Training Data Precision Recall F1-Score
Nova 2 Lite None (baseline) 0.4596 0.3627 0.387
Nova 2 Lite Customer data only 0.6048 0.5266 0.5537
Qwen3-30B Customer data only 0.5933 0.5333 0.5552

After fine-tuning, Nova 2 Lite achieved a remarkable F1 improvement from 0.387 to 0.5537, validating the effectiveness of Nova’s full-parameter SFT for complex enterprise tasks.

Preserving General Capabilities: The MMLU Benchmark

While fine-tuning can yield significant domain-specific gains, it often comes at the expense of general capabilities. The evaluation of Nova 2 Lite revealed the following:

Model Training Data VOC F1-Score MMLU Accuracy
Nova 2 Lite None (baseline) 0.38 0.75
Nova 2 Lite Customer data only 0.55 0.47
Nova 2 Lite 75% customer + 25% Nova data 0.5 0.74
Qwen3-30B Customer data only 0.55 0.0038

Interestingly, while customer data alone caused MMLU accuracy to drop significantly for Nova and Qwen models, the application of Nova data mixing resulted in a near-maintenance of general performance.

Key Findings and Practical Recommendations

The evaluation of Nova Forge illustrates that when foundational models are strong, full-parameter SFT can yield impressive gains for enterprise classification tasks. However, it’s essential to consider how fine-tuning can lead to catastrophic forgetting, diminishing general-purpose capabilities.

Recommendations for Using Nova Forge:

  1. Use Supervised Fine-Tuning: Maximize domain performance for complex tasks.
  2. Apply Nova Data Mixing: Especially when anticipating multi-functional workflows, to mitigate the risk of catastrophic forgetting.

These practices can strike the right balance between model specialization and broader functionality, enabling effective deployment in enterprise contexts.

Conclusion

This post has demonstrated how organizations can leverage Nova Forge’s data mixing capabilities to create specialized AI models while preserving general intelligence. Nova Forge not only enhances task-specific performance but also ensures models remain stable and reliable across various enterprise applications. For those looking to embark on this journey, the Nova Forge Developer Guide is an excellent resource to get started.


In the world of artificial intelligence and machine learning, the fusion of specialization and broad capability is essential for driving impactful results. By taking advantage of innovative solutions like Nova Forge, organizations can position themselves for success in an increasingly complex landscape.

Latest

ChatGPT: The Imitative Innovator – The Observer

Embracing Originality: The Perils of Relying on AI in...

Noetix Robotics Secures Series B Funding

Noetix Robotics Secures Nearly 1 Billion Yuan in Series...

Agencies Face Challenges in Budgeting for AI Token Expenses

Adapting Pricing Models: The Impact of Generative AI on...

Essential Considerations Before Turning to an AI Chatbot for Health Advice

The Role of AI Chatbots in Health Advice: Benefits,...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Deterministic vs. Stochastic: An Overview with ML and Risk Examples

Understanding Deterministic and Stochastic Models: Foundations and Applications in Machine Learning and Risk Assessment Learning Objectives Fundamental Differences: Grasp the core distinctions between deterministic and stochastic...

Advancements in Large Model Inference Container: New Features and Performance Improvements

Enhancing Performance and Reducing Costs in LLM Deployments with AWS Updates Navigating the Challenges of Token Growth in Modern LLMs LMCache Support: Transforming Long-Context Inference Performance Benchmarks...

Reinforcement Fine-Tuning for Amazon Nova: Educating AI via Feedback

Unlocking Domain-Specific Capabilities: A Guide to Reinforcement Fine-Tuning for Amazon Nova Models Bridging the Gap Between General-Purpose AI and Business Needs A New Paradigm: Learning by...