Unveiling Llama 3.3 Swallow: Advancements in Japanese Language Processing with a 70-Billion-Parameter Model
A Technical Report Overview by Kazuki Fujii, Lead Developer
Unveiling the Llama 3.3 Swallow: Advancing Japanese Language AI
In recent advancements in artificial intelligence, the successful development of Llama 3.3 Swallow, led by Kazuki Fujii, marks a significant milestone in Japanese language processing. This blog post summarizes a technical report detailing the project spearheaded by the Institute of Science Tokyo, which employed Amazon SageMaker HyperPod to train a 70-billion-parameter large language model (LLM). This model notably enhances Japanese language capabilities, outperforming various industry leaders, including GPT-4o-mini.
Overview of Llama 3.3 Swallow
The Llama 3.3 Swallow builds upon Meta’s Llama 3.3 architecture, offering specialized enhancements tailored for Japanese. Developed through collaboration between the Okazaki Laboratory and the Yokota Laboratory at the School of Computing, Institute of Science Tokyo, alongside the National Institute of Advanced Industrial Science and Technology (AIST), this model is now accessible in two variants on Hugging Face.
Training Methodology
Training the base model involved continual pre-training from Meta’s Llama 3.3 70B Instruct format, utilizing the Swallow Corpus Version 2, a curated Japanese web corpus derived from Common Crawl. Employing the Swallow Education Classifier, the team ensured high-quality training data was extracted, totaling approximately 314 billion tokens.
For the instruction-tuned variant, fine-tuning focused solely on Japanese dialogue and code generation tasks. By deliberately excluding English dialogue data, the team maintained a firm focus on enhancing Japanese capabilities.
Performance and Benchmarks
In rigorous evaluations, the base model demonstrated remarkable understanding and generation of Japanese text, consistently outperforming leading models such as OpenAI’s GPT-4o and GPT-3.5. Moreover, the instruction-tuned model excelled particularly in Japanese MT-Bench assessments.
Training Infrastructure Architecture
The training infrastructure for Llama 3.3 Swallow comprised Amazon SageMaker HyperPod, with an emphasis on performance, scalability, and observability. Using 32 ml.p5.48xlarge Amazon EC2 instances (H100, 80 GB, 256 GPUs), the team facilitated continual pre-training over a period of 16 days and 6 hours.
High-Performance Networking: The deployment leveraged NCCL over Elastic Fabric Adapter (EFA) for rapid inter-GPU communication, essential for distributed training.
Storage Architecture: A hierarchical storage approach was implemented, combining Amazon S3 for long-term storage with FSx for Lustre as a high-performance parallel file system, ensuring efficient data access crucial for training tasks.
Software Stack and Optimizations
Built on SageMaker HyperPod DLAMI, the software stack integrated CUDA drivers, NCCL, and AWS-OFI-NCCL for optimal performance. Using Megatron-LM as the primary framework, the project capitalized on advanced features for scaling LLM training, incorporating sophisticated model parallelism techniques.
Advanced Parallelism and Communication
The 4D parallelism strategy maximized GPU utilization through data, tensor, pipeline, and sequence parallelism. Additionally, overlapping communication across these domains significantly reduced blocking time, enhancing overall efficiency.
Checkpointing and Experiment Management
An optimized checkpointing strategy facilitated faster save times and minimized training interruptions. With a newly developed memory prediction tool, the team effectively monitored GPU memory usage and optimized configuration settings.
Conclusion
The Llama 3.3 Swallow project showcases innovative methods in large language model training and cloud infrastructure, pushing the boundaries of AI capabilities in the Japanese language. The insights gained from this endeavor offer valuable lessons for future research, development, and applications in various domains.
As the team continues to refine training pipelines and enhance Japanese language capabilities, they plan to open source optimization tools developed during the project, fostering collaboration and innovation within the AI community.
Resources and References
For further reading and access to the model, visit Hugging Face.
About the Authors
The development team includes Kazuki Fujii, a master’s student at the Tokyo Institute of Technology, and senior specialists from Amazon Web Services, each contributing their unique expertise in machine learning and high-performance computing.
This post serves not only as an overview of the technical report but also as a call to action for researchers and engineers enthusiastic about advancing machine learning in Japanese language applications.