Enhancing Generative AI Inference with EAGLE in Amazon SageMaker AI
Accelerating Decoding Through Adaptive Speculative Techniques
Leveraging EAGLE for Optimized Performance in Large Language Models
Flexible Workflow Solutions for Building EAGLE Models
Understanding EAGLE’s Mechanism for Improved Inference
Step-by-Step Optimization via SDK and CLI
Benchmark Results: Evaluating EAGLE’s Impact on Performance
Cost Considerations for Optimization Jobs on SageMaker AI
Conclusion: Unlocking Efficient Generative AI Inference with EAGLE
Meet the Team Behind the Innovations in SageMaker AI
Accelerating AI Inference with EAGLE in Amazon SageMaker
Generative AI models are evolving at an unprecedented pace, pushing the boundaries of what’s possible in artificial intelligence. As these models scale up, the demand for quicker, more efficient inference has intensified. Applications now require low latency, consistent performance, and high-quality outputs. Enter Amazon SageMaker AI, which has introduced significant enhancements to its inference optimization toolkit. The latest update brings EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) driven adaptive speculative decoding to a wider range of model architectures, streamlining the process of accelerating decoding and optimizing performance.
What is EAGLE?
EAGLE is a game-changing technique designed to speed up large language model (LLM) decoding by predicting future tokens directly from the model’s hidden layers. This innovative approach allows for a more tailored optimization process, aligning improvements with the specific data and patterns your applications actually employ. Rather than relying on generic benchmarks, EAGLE-driven optimizations reflect the real workloads you encounter, leading to faster inference tailored precisely to your needs.
As of now, SageMaker AI can utilize either EAGLE 3 or EAGLE 2 heads depending on the model architecture, enabling optimized performance on various platforms.
Continuous Optimization: Beyond One-Time Operations
One of the standout features of SageMaker AI’s EAGLE optimization is that it’s not just a one-time operation. You can initially leverage SageMaker’s curated datasets for training and then refine the model’s performance by incorporating your own curated data over time. A practical example includes using the Data Capture tool to accumulate a dataset from real-time requests hitting your hosted model. This iterative training process fosters continuous improvement, enhancing performance with every cycle.
Solution Overview
Amazon SageMaker now natively supports both EAGLE 2 and EAGLE 3 speculative decoding. This means that various model architectures can apply the method that is most compatible with their underlying design. Whether you choose to use SageMaker JumpStart models or bring your own model artifacts from other hubs like HuggingFace, you’re covered.
Speculative Decoding Explained:
Speculative decoding is an effective approach to accelerate inference without sacrificing quality. A smaller draft model generates preliminary tokens, which are subsequently validated by the target LLM. The efficiency gained from speculative decoding hinges on the choice of draft model. EAGLE further enhances this by leveraging features from the target model directly, preventing unnecessary overhead and allowing for more nuanced and efficient processing.
Flexibility in Model Training
SageMaker AI offers various workflows for building or refining an EAGLE model. You have options to:
- Train an EAGLE model from scratch using the available SageMaker curated datasets.
- Start with your own data to align the optimization process with your specific traffic patterns.
- Retrain an existing EAGLE base model for improved adaptability or fine-tune it using your curated dataset.
SageMaker also provides fully pre-trained EAGLE models enabling you to initiate optimizations without preparing any artifacts. The solution encompasses six supported architectures, with a pre-trained EAGLE base model to expedite your experimental journey.
Supported Models
Currently, SageMaker AI supports EAGLE 3 with LlamaForCausalLM, Qwen3ForCausalLM, Qwen3MoeForCausalLM, Qwen2ForCausalLM, and GptOssForCausalLM, while EAGLE 2 accommodates Qwen3NextForCausalLM.
Benchmarking EAGLE’s Potential
To help customers gauge performance, SageMaker AI automates benchmarking processes during optimization jobs, providing insights into latency and throughput improvements. The results typically show a throughput of around 2.5x over standard decoding while naturally adapting to your use case nuances.
Running Optimization Jobs
You can easily interact with the optimization toolkit via the AWS Python Boto3 SDK or through the AWS CLI. The workflow primarily involves registering your model using the create_model API call and then executing optimization jobs with create_optimization-job.
Example Workflows
-
Using Your Own Model Data with Your Own EAGLE Curated Dataset:
Start by creating a SageMaker model that points to your model artifacts in S3, followed by invoking the optimization job.
-
Bringing Your Trained EAGLE for Further Training:
Specify your EAGLE artifacts in the model registration step, and then fetch the necessary model data when executing optimization jobs.
-
Using SageMaker Built-in Datasets:
Optionally leverage SageMaker’s provided datasets for optimization, ensuring a quick start without the need for extensive data preparation.
Conclusion
EAGLE-based adaptive speculative decoding represents a fast, effective path for improving generative AI inference performance on Amazon SageMaker AI. This innovative approach integrates directly within the model to accelerate decoding and optimize throughput while preserving output quality. By optimizing using your own dataset, you ensure that the model reflects the unique dynamics of your applications, resulting in enhanced end-to-end performance. With built-in dataset support, automated benchmarking, and seamless deployment, the inference optimization toolkit allows you to deliver low-latency generative applications at scale.
About the Authors
Kareem Syed-Mohammed is a Product Manager at AWS focused on generative AI model development.
Xu Deng is a Software Engineer Manager in the SageMaker team, dedicated to optimizing AI/ML inference experiences.
Ram Vegiraju is an ML Architect with experience in building customer-centric AI/ML solutions on SageMaker.
Vinay Arora specializes in designing cutting-edge AI solutions leveraging AWS technologies.
Siddharth Shah is a Principal Engineer, concentrating on large-scale model hosting and optimization.
Andy Peng is an innovative builder involved in a wide array of AWS products and initiatives.
Johna Liu develops AI tools aimed at enhancing efficiency within Amazon SageMaker.
Anisha Kolla focuses on building scalable and efficient AI solutions in the SageMaker Inference team.
By utilizing the latest advancements in Amazon SageMaker AI, you can elevate your AI applications to new heights, paving the way for faster, more intelligent, and adaptable systems.