Exclusive Content:

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

“Revealing Weak Infosec Practices that Open the Door for Cyber Criminals in Your Organization” • The Register

Warning: Stolen ChatGPT Credentials a Hot Commodity on the...

Training Azerbaijani Language Models Using Amazon SageMaker AI

Building an Azerbaijani Language Model: Optimizing Training with Open Source Tools and AWS

Acknowledgments

Introduction to the Challenge

Solution Overview

Stage 1: Tokenizer Development

Stage 2: Continued Pre-training (CPT)

Stage 3: Supervised Fine-tuning with LoRA

Modular Architecture of the Training Pipeline

Developing an Azerbaijani Tokenizer

Continued Pre-training with Optimizations

Distributed Training with Fully Sharded Data Parallel (FSDP)

Liger Kernel Integration

Pre-training Setup

Supervised Fine-tuning with LoRA

Results and Validation

Conclusion

Author Backgrounds

Building an Azerbaijani Large Language Model: A Collaborative Journey

In the rapidly evolving landscape of artificial intelligence, the demand for language models that can process diverse languages has never been higher. Recognizing this need, Azercell Telecom LLC, Azerbaijan’s leading telecommunications provider, embarked on a groundbreaking project to develop an Azerbaijani large language model (LLM) utilizing open-source tools like PyTorch, Hugging Face Transformers, and Liger Kernels. This initiative, conducted in collaboration with the AWS Generative AI Innovation Center, overcame substantial challenges associated with training LLMs for morphologically rich languages.

Grateful Acknowledgment

Before diving into the technical details, it is essential to acknowledge the talented individuals who contributed to this project’s success. Special thanks go to: Aiham Taleb, Arefeh Ghahvechi, Manav Choudhary, Rohit Thekkanal, Daz Akbarov, Jamila Jamilova, Ross Povelikin, Almas Moldakanov, Christelle Xu, and Ivan Khvostishkov for their invaluable contributions.

Addressing Challenges in Azerbaijani Language Processing

The primary challenge faced was the adaptation of foundation models (FMs) to a language with rich morphology and limited training data. Unlike many languages, Azerbaijani often uses suffixes to convey grammatical meaning, making standard tokenization techniques ineffective. Furthermore, there was no existing blueprint for efficient LLM training in this context. Over a six-week period, Azercell and AWS worked tirelessly to establish a production-ready framework on Amazon SageMaker AI, resulting in a significant 23% increase in training throughput and a remarkable 58% reduction in peak GPU memory usage.

Solution Overview

The training framework was organized into three sequential stages, each building on the previous one to effectively create a functional Azerbaijani conversational assistant.

Stage 1: Tokenizer Development

The first stage involved developing a custom tokenizer tailored for the Azerbaijani language. By evaluating three methods—baseline English-optimized tokenizers, vocabulary extensions, and custom monolingual tokenizers—the team found that the custom solution significantly improved encoding efficiency. This innovation effectively halved the number of tokens generated per word, enhancing the model’s ability to process Azerbaijani text without compromising quality.

Encoding Efficiency

The adoption of a Byte-Level Byte-Pair Encoding (BBPE) algorithm ensured that the tokenizer could handle all Azerbaijani characters proficiently. The final configuration, with a vocabulary of 100,000 tokens, significantly outperformed baseline models. Metrics such as Bits-Per-Byte (BPB) showcased the effectiveness of the custom tokenizer in maintaining high modeling quality while improving efficiency.

Stage 2: Continued Pre-Training (CPT)

Next, the team employed continued pre-training (CPT) using the Llama 3.2 1B model. To enhance throughput and allow larger batch sizes, the training process incorporated distributed training techniques and Liger Kernel optimizations.

GPU Memory Optimization

The use of Fully Sharded Data Parallel (FSDP) training methods was pivotal in reducing memory overhead and improving training efficiency. The integration of Liger Kernels further optimized performance by fusing operations into single GPU kernel launches, thereby enhancing memory efficiency without sacrificing computational accuracy.

Stage 3: Supervised Fine-Tuning with Low-Rank Adaptation (LoRA)

The final stage involved fine-tuning the pre-trained model using LoRA, which allowed for efficient adaptation to conversational contexts. By training low-rank decomposition matrices, the model became proficient in generating coherent and context-aware Azerbaijani responses, converting raw token predictions into conversational outputs.

Results and Achievements

The outcomes of the project were substantial, demonstrating significant improvements in several key dimensions:

  1. Encoding Efficiency: The custom tokenizer doubled the Azerbaijani content that the model could process simultaneously, resulting in a notable reduction in token fragmentation.
  2. Memory and Throughput Optimization: The adoption of advanced distributed training and kernel integration allowed for dramatic increases in batch size and decreases in peak memory usage, facilitating smoother training processes on Amazon SageMaker AI.
  3. Production-Ready Infrastructure: By establishing a scalable training framework, Azercell is now empowered to continuously expand its applications and model capabilities.
  4. Enhanced Language Generation: The fine-tuned Llama model produced coherent and relevant Azerbaijani outputs, moving beyond the repetitive errors often encountered with models underrepresented in data.

Conclusion

The collaboration between Azercell and the AWS Generative AI Innovation Center showcases the potential of harnessing open-source tools to confront real-world challenges in language processing. By developing a structured framework that champions custom solutions, they have laid the foundation for AI applications in low-resource languages.

If you are interested in implementing similar techniques or have questions regarding low-resource language processing, we encourage you to connect with your AWS account team or reach out to the AWS Generative AI Innovation Center. We look forward to hearing your thoughts and experiences!


About the Authors

Explore the expertise of the team behind this innovative project:

  • Aleksei Iancheruk, Data Scientist at AWS GenAIIC, focuses on search and retrieval systems.
  • Debby Wehner, Machine Learning Engineer at AWS GenAIIC, specializes in LLM optimization.
  • Hanno Bever, Senior Machine Learning Engineer at AWS GenAIIC, excels in scaling model training.
  • Sabir Mardanov, leader of Azercell’s Data & AI organization, is transforming traditional telco operations into a tech-centered model.
  • Irada Bunyatova, Senior Data Scientist at Azercell, integrates speech tech and agentic AI systems into business applications.

Through collaboration, innovation, and a commitment to quality, this project exemplifies the future of AI language processing.

Latest

Inside the British Laboratory Creating Computers Colder Than Deep Space

Exploring Quantum Computing at the 2026 Festival of Speed What...

Developing AI Agents for Business Assistance with Amazon Bedrock AgentCore

Streamlining HR Tasks: Developing AI Agents with Works Human...

UK Laws ‘Unprepared’ for Humanoid Robots

Urgent Call for Regulation as Humanoid Robots Approach Everyday...

Don't miss

Haiper steps out of stealth mode, secures $13.8 million seed funding for video-generative AI

Haiper Emerges from Stealth Mode with $13.8 Million Seed...

Running Your ML Notebook on Databricks: A Step-by-Step Guide

A Step-by-Step Guide to Hosting Machine Learning Notebooks in...

VOXI UK Launches First AI Chatbot to Support Customers

VOXI Launches AI Chatbot to Revolutionize Customer Services in...

Investing in digital infrastructure key to realizing generative AI’s potential for driving economic growth | articles

Challenges Hindering the Widescale Deployment of Generative AI: Legal,...

Automate Financial Document Processing with Amazon Bedrock

Streamlining Financial Document Processing with Amazon Bedrock Data Automation Automating Data Extraction from Diverse Financial Documents Solution Overview: Configuring Custom Blueprints Developing Custom Blueprints for Key Financial...

In-Depth Analysis: AgentCore Payments and Innovations in Agentic Commerce

Navigating the Future of Autonomous Transactions with Amazon Bedrock AgentCore Payments A New Era of Generative AI Agents Transforming How AI Agents Operate and Pay in...

Create Real-Time Voice Applications Using Amazon SageMaker AI and vLLM

Real-Time Speech-to-Text with Amazon SageMaker AI and vLLM: A Comprehensive Guide to Bidirectional Streaming Key Features Required to Run Voice AI Applications Solution Overview The Realtime API...