Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Training Llama 3.3 Swallow: A Japanese Sovereign LLM Using Amazon SageMaker HyperPod

Unveiling Llama 3.3 Swallow: Advancements in Japanese Language Processing with a 70-Billion-Parameter Model

A Technical Report Overview by Kazuki Fujii, Lead Developer

Unveiling the Llama 3.3 Swallow: Advancing Japanese Language AI

In recent advancements in artificial intelligence, the successful development of Llama 3.3 Swallow, led by Kazuki Fujii, marks a significant milestone in Japanese language processing. This blog post summarizes a technical report detailing the project spearheaded by the Institute of Science Tokyo, which employed Amazon SageMaker HyperPod to train a 70-billion-parameter large language model (LLM). This model notably enhances Japanese language capabilities, outperforming various industry leaders, including GPT-4o-mini.

Overview of Llama 3.3 Swallow

The Llama 3.3 Swallow builds upon Meta’s Llama 3.3 architecture, offering specialized enhancements tailored for Japanese. Developed through collaboration between the Okazaki Laboratory and the Yokota Laboratory at the School of Computing, Institute of Science Tokyo, alongside the National Institute of Advanced Industrial Science and Technology (AIST), this model is now accessible in two variants on Hugging Face.

Training Methodology

Training the base model involved continual pre-training from Meta’s Llama 3.3 70B Instruct format, utilizing the Swallow Corpus Version 2, a curated Japanese web corpus derived from Common Crawl. Employing the Swallow Education Classifier, the team ensured high-quality training data was extracted, totaling approximately 314 billion tokens.

For the instruction-tuned variant, fine-tuning focused solely on Japanese dialogue and code generation tasks. By deliberately excluding English dialogue data, the team maintained a firm focus on enhancing Japanese capabilities.

Performance and Benchmarks

In rigorous evaluations, the base model demonstrated remarkable understanding and generation of Japanese text, consistently outperforming leading models such as OpenAI’s GPT-4o and GPT-3.5. Moreover, the instruction-tuned model excelled particularly in Japanese MT-Bench assessments.

Training Infrastructure Architecture

The training infrastructure for Llama 3.3 Swallow comprised Amazon SageMaker HyperPod, with an emphasis on performance, scalability, and observability. Using 32 ml.p5.48xlarge Amazon EC2 instances (H100, 80 GB, 256 GPUs), the team facilitated continual pre-training over a period of 16 days and 6 hours.

High-Performance Networking: The deployment leveraged NCCL over Elastic Fabric Adapter (EFA) for rapid inter-GPU communication, essential for distributed training.

Storage Architecture: A hierarchical storage approach was implemented, combining Amazon S3 for long-term storage with FSx for Lustre as a high-performance parallel file system, ensuring efficient data access crucial for training tasks.

Software Stack and Optimizations

Built on SageMaker HyperPod DLAMI, the software stack integrated CUDA drivers, NCCL, and AWS-OFI-NCCL for optimal performance. Using Megatron-LM as the primary framework, the project capitalized on advanced features for scaling LLM training, incorporating sophisticated model parallelism techniques.

Advanced Parallelism and Communication

The 4D parallelism strategy maximized GPU utilization through data, tensor, pipeline, and sequence parallelism. Additionally, overlapping communication across these domains significantly reduced blocking time, enhancing overall efficiency.

Checkpointing and Experiment Management

An optimized checkpointing strategy facilitated faster save times and minimized training interruptions. With a newly developed memory prediction tool, the team effectively monitored GPU memory usage and optimized configuration settings.

Conclusion

The Llama 3.3 Swallow project showcases innovative methods in large language model training and cloud infrastructure, pushing the boundaries of AI capabilities in the Japanese language. The insights gained from this endeavor offer valuable lessons for future research, development, and applications in various domains.

As the team continues to refine training pipelines and enhance Japanese language capabilities, they plan to open source optimization tools developed during the project, fostering collaboration and innovation within the AI community.


Resources and References

For further reading and access to the model, visit Hugging Face.

About the Authors

The development team includes Kazuki Fujii, a master’s student at the Tokyo Institute of Technology, and senior specialists from Amazon Web Services, each contributing their unique expertise in machine learning and high-performance computing.


This post serves not only as an overview of the technical report but also as a call to action for researchers and engineers enthusiastic about advancing machine learning in Japanese language applications.

Latest

Creating a Personal Productivity Assistant Using GLM-5

From Idea to Reality: Building a Personal Productivity Agent...

Lawsuits Claim ChatGPT Contributed to Suicide and Psychosis

The Dark Side of AI: ChatGPT's Alleged Role in...

Japan’s Robotics Sector Hits Record Orders Amid Growing Global Labor Shortages

Japan's Robotics Boom: Navigating Labor Shortages and Global Competition Add...

Analysis of Major Market Segments Fueling the Digital Language Sector

Exploring the Rapid Growth of the Digital Language Learning...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Creating a Personal Productivity Assistant Using GLM-5

From Idea to Reality: Building a Personal Productivity Agent in Just Five Minutes with GLM-5 AI A Revolutionary Approach to Application Development This headline captures the...

Creating Smart Event Agents with Amazon Bedrock AgentCore and Knowledge Bases

Deploying a Production-Ready Event Assistant Using Amazon Bedrock AgentCore Transforming Conference Navigation with AI Introduction to Event Assistance Challenges Building an Intelligent Companion with Amazon Bedrock AgentCore Solution...

A Comprehensive Guide to Machine Learning for Time Series Analysis

Mastering Feature Engineering for Time Series: A Comprehensive Guide Understanding Feature Engineering in Time Series Data The Essential Role of Lag Features in Time Series Analysis Unpacking...