Building an Azerbaijani Language Model: Optimizing Training with Open Source Tools and AWS

Acknowledgments

Introduction to the Challenge

Solution Overview

Stage 1: Tokenizer Development

Stage 2: Continued Pre-training (CPT)

Stage 3: Supervised Fine-tuning with LoRA

Modular Architecture of the Training Pipeline

Developing an Azerbaijani Tokenizer

Continued Pre-training with Optimizations

Distributed Training with Fully Sharded Data Parallel (FSDP)

Liger Kernel Integration

Pre-training Setup

Supervised Fine-tuning with LoRA

Results and Validation

Conclusion

Author Backgrounds

Building an Azerbaijani Large Language Model: A Collaborative Journey

In the rapidly evolving landscape of artificial intelligence, the demand for language models that can process diverse languages has never been higher. Recognizing this need, Azercell Telecom LLC, Azerbaijan’s leading telecommunications provider, embarked on a groundbreaking project to develop an Azerbaijani large language model (LLM) utilizing open-source tools like PyTorch, Hugging Face Transformers, and Liger Kernels. This initiative, conducted in collaboration with the AWS Generative AI Innovation Center, overcame substantial challenges associated with training LLMs for morphologically rich languages.

Grateful Acknowledgment

Before diving into the technical details, it is essential to acknowledge the talented individuals who contributed to this project’s success. Special thanks go to: Aiham Taleb, Arefeh Ghahvechi, Manav Choudhary, Rohit Thekkanal, Daz Akbarov, Jamila Jamilova, Ross Povelikin, Almas Moldakanov, Christelle Xu, and Ivan Khvostishkov for their invaluable contributions.

Addressing Challenges in Azerbaijani Language Processing

The primary challenge faced was the adaptation of foundation models (FMs) to a language with rich morphology and limited training data. Unlike many languages, Azerbaijani often uses suffixes to convey grammatical meaning, making standard tokenization techniques ineffective. Furthermore, there was no existing blueprint for efficient LLM training in this context. Over a six-week period, Azercell and AWS worked tirelessly to establish a production-ready framework on Amazon SageMaker AI, resulting in a significant 23% increase in training throughput and a remarkable 58% reduction in peak GPU memory usage.

Solution Overview

The training framework was organized into three sequential stages, each building on the previous one to effectively create a functional Azerbaijani conversational assistant.

Stage 1: Tokenizer Development

The first stage involved developing a custom tokenizer tailored for the Azerbaijani language. By evaluating three methods—baseline English-optimized tokenizers, vocabulary extensions, and custom monolingual tokenizers—the team found that the custom solution significantly improved encoding efficiency. This innovation effectively halved the number of tokens generated per word, enhancing the model’s ability to process Azerbaijani text without compromising quality.

Encoding Efficiency

The adoption of a Byte-Level Byte-Pair Encoding (BBPE) algorithm ensured that the tokenizer could handle all Azerbaijani characters proficiently. The final configuration, with a vocabulary of 100,000 tokens, significantly outperformed baseline models. Metrics such as Bits-Per-Byte (BPB) showcased the effectiveness of the custom tokenizer in maintaining high modeling quality while improving efficiency.

Stage 2: Continued Pre-Training (CPT)

Next, the team employed continued pre-training (CPT) using the Llama 3.2 1B model. To enhance throughput and allow larger batch sizes, the training process incorporated distributed training techniques and Liger Kernel optimizations.

GPU Memory Optimization

The use of Fully Sharded Data Parallel (FSDP) training methods was pivotal in reducing memory overhead and improving training efficiency. The integration of Liger Kernels further optimized performance by fusing operations into single GPU kernel launches, thereby enhancing memory efficiency without sacrificing computational accuracy.

Stage 3: Supervised Fine-Tuning with Low-Rank Adaptation (LoRA)

The final stage involved fine-tuning the pre-trained model using LoRA, which allowed for efficient adaptation to conversational contexts. By training low-rank decomposition matrices, the model became proficient in generating coherent and context-aware Azerbaijani responses, converting raw token predictions into conversational outputs.

Results and Achievements

The outcomes of the project were substantial, demonstrating significant improvements in several key dimensions:

Encoding Efficiency: The custom tokenizer doubled the Azerbaijani content that the model could process simultaneously, resulting in a notable reduction in token fragmentation.
Memory and Throughput Optimization: The adoption of advanced distributed training and kernel integration allowed for dramatic increases in batch size and decreases in peak memory usage, facilitating smoother training processes on Amazon SageMaker AI.
Production-Ready Infrastructure: By establishing a scalable training framework, Azercell is now empowered to continuously expand its applications and model capabilities.
Enhanced Language Generation: The fine-tuned Llama model produced coherent and relevant Azerbaijani outputs, moving beyond the repetitive errors often encountered with models underrepresented in data.

Conclusion

The collaboration between Azercell and the AWS Generative AI Innovation Center showcases the potential of harnessing open-source tools to confront real-world challenges in language processing. By developing a structured framework that champions custom solutions, they have laid the foundation for AI applications in low-resource languages.

If you are interested in implementing similar techniques or have questions regarding low-resource language processing, we encourage you to connect with your AWS account team or reach out to the AWS Generative AI Innovation Center. We look forward to hearing your thoughts and experiences!

About the Authors

Explore the expertise of the team behind this innovative project:

Aleksei Iancheruk, Data Scientist at AWS GenAIIC, focuses on search and retrieval systems.
Debby Wehner, Machine Learning Engineer at AWS GenAIIC, specializes in LLM optimization.
Hanno Bever, Senior Machine Learning Engineer at AWS GenAIIC, excels in scaling model training.
Sabir Mardanov, leader of Azercell’s Data & AI organization, is transforming traditional telco operations into a tech-centered model.
Irada Bunyatova, Senior Data Scientist at Azercell, integrates speech tech and agentic AI systems into business applications.

Through collaboration, innovation, and a commitment to quality, this project exemplifies the future of AI language processing.

Exclusive Content:

Training Azerbaijani Language Models Using Amazon SageMaker AI

Building an Azerbaijani Language Model: Optimizing Training with Open Source Tools and AWS

Acknowledgments

Introduction to the Challenge

Solution Overview

Stage 1: Tokenizer Development

Stage 2: Continued Pre-training (CPT)

Stage 3: Supervised Fine-tuning with LoRA

Modular Architecture of the Training Pipeline

Developing an Azerbaijani Tokenizer

Continued Pre-training with Optimizations

Distributed Training with Fully Sharded Data Parallel (FSDP)

Liger Kernel Integration

Pre-training Setup

Supervised Fine-tuning with LoRA

Results and Validation

Conclusion

Author Backgrounds

Building an Azerbaijani Large Language Model: A Collaborative Journey

Grateful Acknowledgment

Addressing Challenges in Azerbaijani Language Processing

Solution Overview

Stage 1: Tokenizer Development

Encoding Efficiency

Stage 2: Continued Pre-Training (CPT)

GPU Memory Optimization

Stage 3: Supervised Fine-Tuning with Low-Rank Adaptation (LoRA)

Results and Achievements

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe