Enhancing Large Language Models: Addressing Specialized Task Limitations with Supervised Fine-Tuning and Nova Forge

The Challenge of Customer Feedback Classification

Evaluation Methodology

Test Overview

In-Domain Task Evaluation: Voice of Customer Classification

Key Findings and Practical Recommendations

Conclusion

About the Authors

Unlocking the Power of Language Models: Nova Forge’s Solution for Specialized Tasks

Large Language Models (LLMs) have transformed the landscape of artificial intelligence, providing remarkable capabilities for general tasks. However, they often falter in specialized roles that demand a nuanced understanding of proprietary data, internal processes, and industry-specific terminology. This is where Supervised Fine-Tuning (SFT) comes into play, adapting LLMs to meet organizational needs.

The Challenge of Specialization: Balancing Expertise and General Intelligence

SFT can be implemented through two methodologies:

Parameter-Efficient Fine-Tuning (PEFT): This approach updates only a subset of model parameters, allowing for faster training and reduced computational costs while still delivering reasonable performance improvements.
Full-rank SFT: In contrast, this method updates all model parameters, integrating more domain knowledge but often leads to a phenomenon known as catastrophic forgetting. As models specialize, they risk losing foundational capabilities like reasoning, instruction-following, and broad knowledge. This creates a dilemma for organizations: choose between domain expertise or general intelligence, hindering a model’s utility across diverse enterprise use cases.

Enter Nova Forge: A Game Changer for Custom Model Development

Amazon Nova Forge offers a compelling solution to the aforementioned challenges. This service enables organizations to develop custom frontier models using initial model checkpoints, seamlessly blending proprietary data with Amazon Nova-curated training data. Moreover, customers can securely host their custom models on AWS, providing a robust framework for building solutions tailored to unique business needs.

A Real-World Test: Voice of Customer (VOC) Classification Task

To showcase Nova Forge’s efficacy, the AWS China Applied Science team undertook a comprehensive evaluation utilizing a VOC classification task. This task involved classifying over 16,000 customer comments into a meticulously structured four-level label hierarchy encompassing 1,420 leaf categories. The evaluation revealed two key advantages of Nova Forge’s data mixing approach:

In-Domain Performance Gains: A significant 17% improvement in F1 scores.
Preserved General Capabilities: Maintained near-baseline scores for Massive Multitask Language Understanding (MMLU) and instruction-following abilities post-fine-tuning.

The Challenge: Real-World Customer Feedback Classification

Consider a typical scenario in a large ecommerce company. The customer experience team receives thousands of comments daily that cover feedback on product quality, delivery experiences, and customer service decisions. Efficient operation requires an LLM capable of automatically classifying each comment into actionable categories, ensuring that issues reach the appropriate teams. This demands domain specialization.

Simultaneously, the same LLM must possess:

The ability to generate customer-facing responses with effective communication skills.
The capability to perform data analysis requiring logical reasoning.
The skill to draft documentation in a specific format.

This dual requirement underscores the necessity for an LLM to uphold broad general capabilities while also being specialized.

Evaluation Methodology: Measuring Both Specialization and General Abilities

To assess Nova Forge’s effectiveness, a dual-evaluation framework was established to gauge performance across two dimensions:

Domain-Specific Performance: A VOC dataset reflecting real-world customer reviews was employed, comprising 14,511 training samples and 861 test samples. The extreme class imbalance typical in real-world scenarios added complexity to classification accuracy.
General Purpose Capabilities: Using the public test set of the MMLU benchmark, we measured improvements in domain performance against potential degradation of foundational model behaviors.

In-Domain Task Evaluation

In evaluating Nova Forge’s performance, we observed a notable baseline evaluation for VOC classification tasks. Here’s a snapshot of the comparative performance of the models examined:

Model	Precision	Recall	F1-Score
Nova 2 Lite	0.4596	0.3627	0.387
Qwen3-30B-A3B	0.4567	0.3864	0.394

Both models exhibited comparable performance in finely grained classifications, illustrating the inherent difficulty of the task.

Supervised Fine-Tuning

Model improvement became evident following the application of full-parameter SFT on customer VOC data. Outcomes demonstrated:

Model	Training Data	Precision	Recall	F1-Score
Nova 2 Lite	None (baseline)	0.4596	0.3627	0.387
Nova 2 Lite	Customer data only	0.6048	0.5266	0.5537
Qwen3-30B	Customer data only	0.5933	0.5333	0.5552

After fine-tuning, Nova 2 Lite achieved a remarkable F1 improvement from 0.387 to 0.5537, validating the effectiveness of Nova’s full-parameter SFT for complex enterprise tasks.

Preserving General Capabilities: The MMLU Benchmark

While fine-tuning can yield significant domain-specific gains, it often comes at the expense of general capabilities. The evaluation of Nova 2 Lite revealed the following:

Model	Training Data	VOC F1-Score	MMLU Accuracy
Nova 2 Lite	None (baseline)	0.38	0.75
Nova 2 Lite	Customer data only	0.55	0.47
Nova 2 Lite	75% customer + 25% Nova data	0.5	0.74
Qwen3-30B	Customer data only	0.55	0.0038

Interestingly, while customer data alone caused MMLU accuracy to drop significantly for Nova and Qwen models, the application of Nova data mixing resulted in a near-maintenance of general performance.

Key Findings and Practical Recommendations

The evaluation of Nova Forge illustrates that when foundational models are strong, full-parameter SFT can yield impressive gains for enterprise classification tasks. However, it’s essential to consider how fine-tuning can lead to catastrophic forgetting, diminishing general-purpose capabilities.

Recommendations for Using Nova Forge:

Use Supervised Fine-Tuning: Maximize domain performance for complex tasks.
Apply Nova Data Mixing: Especially when anticipating multi-functional workflows, to mitigate the risk of catastrophic forgetting.

These practices can strike the right balance between model specialization and broader functionality, enabling effective deployment in enterprise contexts.

Conclusion

This post has demonstrated how organizations can leverage Nova Forge’s data mixing capabilities to create specialized AI models while preserving general intelligence. Nova Forge not only enhances task-specific performance but also ensures models remain stable and reliable across various enterprise applications. For those looking to embark on this journey, the Nova Forge Developer Guide is an excellent resource to get started.

In the world of artificial intelligence and machine learning, the fusion of specialization and broad capability is essential for driving impactful results. By taking advantage of innovative solutions like Nova Forge, organizations can position themselves for success in an increasingly complex landscape.

Exclusive Content:

Crafting Specialized AI While Preserving Intelligence: Nova Forge Data Mixing Unleashed

Enhancing Large Language Models: Addressing Specialized Task Limitations with Supervised Fine-Tuning and Nova Forge

The Challenge of Customer Feedback Classification

Evaluation Methodology

Test Overview

In-Domain Task Evaluation: Voice of Customer Classification

Key Findings and Practical Recommendations

Conclusion

About the Authors

Unlocking the Power of Language Models: Nova Forge’s Solution for Specialized Tasks

The Challenge of Specialization: Balancing Expertise and General Intelligence

Enter Nova Forge: A Game Changer for Custom Model Development

A Real-World Test: Voice of Customer (VOC) Classification Task

The Challenge: Real-World Customer Feedback Classification

Evaluation Methodology: Measuring Both Specialization and General Abilities

In-Domain Task Evaluation

Supervised Fine-Tuning

Preserving General Capabilities: The MMLU Benchmark

Key Findings and Practical Recommendations

Recommendations for Using Nova Forge:

Conclusion

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe