Unveiling Llama 3.3 Swallow: Advancements in Japanese Language Processing with a 70-Billion-Parameter Model

A Technical Report Overview by Kazuki Fujii, Lead Developer

Unveiling the Llama 3.3 Swallow: Advancing Japanese Language AI

In recent advancements in artificial intelligence, the successful development of Llama 3.3 Swallow, led by Kazuki Fujii, marks a significant milestone in Japanese language processing. This blog post summarizes a technical report detailing the project spearheaded by the Institute of Science Tokyo, which employed Amazon SageMaker HyperPod to train a 70-billion-parameter large language model (LLM). This model notably enhances Japanese language capabilities, outperforming various industry leaders, including GPT-4o-mini.

Overview of Llama 3.3 Swallow

The Llama 3.3 Swallow builds upon Meta’s Llama 3.3 architecture, offering specialized enhancements tailored for Japanese. Developed through collaboration between the Okazaki Laboratory and the Yokota Laboratory at the School of Computing, Institute of Science Tokyo, alongside the National Institute of Advanced Industrial Science and Technology (AIST), this model is now accessible in two variants on Hugging Face.

Training Methodology

Training the base model involved continual pre-training from Meta’s Llama 3.3 70B Instruct format, utilizing the Swallow Corpus Version 2, a curated Japanese web corpus derived from Common Crawl. Employing the Swallow Education Classifier, the team ensured high-quality training data was extracted, totaling approximately 314 billion tokens.

For the instruction-tuned variant, fine-tuning focused solely on Japanese dialogue and code generation tasks. By deliberately excluding English dialogue data, the team maintained a firm focus on enhancing Japanese capabilities.

Performance and Benchmarks

In rigorous evaluations, the base model demonstrated remarkable understanding and generation of Japanese text, consistently outperforming leading models such as OpenAI’s GPT-4o and GPT-3.5. Moreover, the instruction-tuned model excelled particularly in Japanese MT-Bench assessments.

Training Infrastructure Architecture

The training infrastructure for Llama 3.3 Swallow comprised Amazon SageMaker HyperPod, with an emphasis on performance, scalability, and observability. Using 32 ml.p5.48xlarge Amazon EC2 instances (H100, 80 GB, 256 GPUs), the team facilitated continual pre-training over a period of 16 days and 6 hours.

High-Performance Networking: The deployment leveraged NCCL over Elastic Fabric Adapter (EFA) for rapid inter-GPU communication, essential for distributed training.

Storage Architecture: A hierarchical storage approach was implemented, combining Amazon S3 for long-term storage with FSx for Lustre as a high-performance parallel file system, ensuring efficient data access crucial for training tasks.

Software Stack and Optimizations

Built on SageMaker HyperPod DLAMI, the software stack integrated CUDA drivers, NCCL, and AWS-OFI-NCCL for optimal performance. Using Megatron-LM as the primary framework, the project capitalized on advanced features for scaling LLM training, incorporating sophisticated model parallelism techniques.

Advanced Parallelism and Communication

The 4D parallelism strategy maximized GPU utilization through data, tensor, pipeline, and sequence parallelism. Additionally, overlapping communication across these domains significantly reduced blocking time, enhancing overall efficiency.

Checkpointing and Experiment Management

An optimized checkpointing strategy facilitated faster save times and minimized training interruptions. With a newly developed memory prediction tool, the team effectively monitored GPU memory usage and optimized configuration settings.

Conclusion

The Llama 3.3 Swallow project showcases innovative methods in large language model training and cloud infrastructure, pushing the boundaries of AI capabilities in the Japanese language. The insights gained from this endeavor offer valuable lessons for future research, development, and applications in various domains.

As the team continues to refine training pipelines and enhance Japanese language capabilities, they plan to open source optimization tools developed during the project, fostering collaboration and innovation within the AI community.

Resources and References

For further reading and access to the model, visit Hugging Face.

About the Authors

The development team includes Kazuki Fujii, a master’s student at the Tokyo Institute of Technology, and senior specialists from Amazon Web Services, each contributing their unique expertise in machine learning and high-performance computing.

This post serves not only as an overview of the technical report but also as a call to action for researchers and engineers enthusiastic about advancing machine learning in Japanese language applications.

Exclusive Content:

Training Llama 3.3 Swallow: A Japanese Sovereign LLM Using Amazon SageMaker HyperPod

Unveiling Llama 3.3 Swallow: Advancements in Japanese Language Processing with a 70-Billion-Parameter Model

A Technical Report Overview by Kazuki Fujii, Lead Developer

Unveiling the Llama 3.3 Swallow: Advancing Japanese Language AI

Overview of Llama 3.3 Swallow

Training Methodology

Performance and Benchmarks

Training Infrastructure Architecture

Software Stack and Optimizations

Advanced Parallelism and Communication

Checkpointing and Experiment Management

Conclusion

Resources and References

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe