Enhancing Large Language Models: Addressing Specialized Task Limitations with Supervised Fine-Tuning and Nova Forge
The Challenge of Customer Feedback Classification
Evaluation Methodology
Test Overview
In-Domain Task Evaluation: Voice of Customer Classification
Key Findings and Practical Recommendations
Conclusion
About the Authors
Unlocking the Power of Language Models: Nova Forge’s Solution for Specialized Tasks
Large Language Models (LLMs) have transformed the landscape of artificial intelligence, providing remarkable capabilities for general tasks. However, they often falter in specialized roles that demand a nuanced understanding of proprietary data, internal processes, and industry-specific terminology. This is where Supervised Fine-Tuning (SFT) comes into play, adapting LLMs to meet organizational needs.
The Challenge of Specialization: Balancing Expertise and General Intelligence
SFT can be implemented through two methodologies:
-
Parameter-Efficient Fine-Tuning (PEFT): This approach updates only a subset of model parameters, allowing for faster training and reduced computational costs while still delivering reasonable performance improvements.
-
Full-rank SFT: In contrast, this method updates all model parameters, integrating more domain knowledge but often leads to a phenomenon known as catastrophic forgetting. As models specialize, they risk losing foundational capabilities like reasoning, instruction-following, and broad knowledge. This creates a dilemma for organizations: choose between domain expertise or general intelligence, hindering a model’s utility across diverse enterprise use cases.
Enter Nova Forge: A Game Changer for Custom Model Development
Amazon Nova Forge offers a compelling solution to the aforementioned challenges. This service enables organizations to develop custom frontier models using initial model checkpoints, seamlessly blending proprietary data with Amazon Nova-curated training data. Moreover, customers can securely host their custom models on AWS, providing a robust framework for building solutions tailored to unique business needs.
A Real-World Test: Voice of Customer (VOC) Classification Task
To showcase Nova Forge’s efficacy, the AWS China Applied Science team undertook a comprehensive evaluation utilizing a VOC classification task. This task involved classifying over 16,000 customer comments into a meticulously structured four-level label hierarchy encompassing 1,420 leaf categories. The evaluation revealed two key advantages of Nova Forge’s data mixing approach:
- In-Domain Performance Gains: A significant 17% improvement in F1 scores.
- Preserved General Capabilities: Maintained near-baseline scores for Massive Multitask Language Understanding (MMLU) and instruction-following abilities post-fine-tuning.
The Challenge: Real-World Customer Feedback Classification
Consider a typical scenario in a large ecommerce company. The customer experience team receives thousands of comments daily that cover feedback on product quality, delivery experiences, and customer service decisions. Efficient operation requires an LLM capable of automatically classifying each comment into actionable categories, ensuring that issues reach the appropriate teams. This demands domain specialization.
Simultaneously, the same LLM must possess:
- The ability to generate customer-facing responses with effective communication skills.
- The capability to perform data analysis requiring logical reasoning.
- The skill to draft documentation in a specific format.
This dual requirement underscores the necessity for an LLM to uphold broad general capabilities while also being specialized.
Evaluation Methodology: Measuring Both Specialization and General Abilities
To assess Nova Forge’s effectiveness, a dual-evaluation framework was established to gauge performance across two dimensions:
-
Domain-Specific Performance: A VOC dataset reflecting real-world customer reviews was employed, comprising 14,511 training samples and 861 test samples. The extreme class imbalance typical in real-world scenarios added complexity to classification accuracy.
-
General Purpose Capabilities: Using the public test set of the MMLU benchmark, we measured improvements in domain performance against potential degradation of foundational model behaviors.
In-Domain Task Evaluation
In evaluating Nova Forge’s performance, we observed a notable baseline evaluation for VOC classification tasks. Here’s a snapshot of the comparative performance of the models examined:
| Model | Precision | Recall | F1-Score |
|---|---|---|---|
| Nova 2 Lite | 0.4596 | 0.3627 | 0.387 |
| Qwen3-30B-A3B | 0.4567 | 0.3864 | 0.394 |
Both models exhibited comparable performance in finely grained classifications, illustrating the inherent difficulty of the task.
Supervised Fine-Tuning
Model improvement became evident following the application of full-parameter SFT on customer VOC data. Outcomes demonstrated:
| Model | Training Data | Precision | Recall | F1-Score |
|---|---|---|---|---|
| Nova 2 Lite | None (baseline) | 0.4596 | 0.3627 | 0.387 |
| Nova 2 Lite | Customer data only | 0.6048 | 0.5266 | 0.5537 |
| Qwen3-30B | Customer data only | 0.5933 | 0.5333 | 0.5552 |
After fine-tuning, Nova 2 Lite achieved a remarkable F1 improvement from 0.387 to 0.5537, validating the effectiveness of Nova’s full-parameter SFT for complex enterprise tasks.
Preserving General Capabilities: The MMLU Benchmark
While fine-tuning can yield significant domain-specific gains, it often comes at the expense of general capabilities. The evaluation of Nova 2 Lite revealed the following:
| Model | Training Data | VOC F1-Score | MMLU Accuracy |
|---|---|---|---|
| Nova 2 Lite | None (baseline) | 0.38 | 0.75 |
| Nova 2 Lite | Customer data only | 0.55 | 0.47 |
| Nova 2 Lite | 75% customer + 25% Nova data | 0.5 | 0.74 |
| Qwen3-30B | Customer data only | 0.55 | 0.0038 |
Interestingly, while customer data alone caused MMLU accuracy to drop significantly for Nova and Qwen models, the application of Nova data mixing resulted in a near-maintenance of general performance.
Key Findings and Practical Recommendations
The evaluation of Nova Forge illustrates that when foundational models are strong, full-parameter SFT can yield impressive gains for enterprise classification tasks. However, it’s essential to consider how fine-tuning can lead to catastrophic forgetting, diminishing general-purpose capabilities.
Recommendations for Using Nova Forge:
- Use Supervised Fine-Tuning: Maximize domain performance for complex tasks.
- Apply Nova Data Mixing: Especially when anticipating multi-functional workflows, to mitigate the risk of catastrophic forgetting.
These practices can strike the right balance between model specialization and broader functionality, enabling effective deployment in enterprise contexts.
Conclusion
This post has demonstrated how organizations can leverage Nova Forge’s data mixing capabilities to create specialized AI models while preserving general intelligence. Nova Forge not only enhances task-specific performance but also ensures models remain stable and reliable across various enterprise applications. For those looking to embark on this journey, the Nova Forge Developer Guide is an excellent resource to get started.
In the world of artificial intelligence and machine learning, the fusion of specialization and broad capability is essential for driving impactful results. By taking advantage of innovative solutions like Nova Forge, organizations can position themselves for success in an increasingly complex landscape.