Announcing G7e Instances: Next-Generation GPU-Accelerated Inference on Amazon SageMaker AI
Unlocking Enhanced Performance and Cost-Effectiveness for Generative AI Workloads
Key Features and Benefits of G7e Instances
Comparative Analysis: G7e vs. Previous Generations
Use Cases: Maximizing the Potential of G7e Instances
Deployment and Performance Insights
Financial Efficiency: Pricing & Cost Comparisons
Conclusion: The Future of Cost-Effective Generative AI Inference
Meet the Contributors Behind G7e Innovations
Unleashing the Power of Generative AI: Announcing G7e Instances on Amazon SageMaker AI
As the demand for generative AI skyrockets, developers and enterprises are in constant pursuit of flexible, cost-effective, and robust solutions to meet their diverse needs. Today, we are excited to announce a significant advancement in this arena: G7e instances powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Amazon SageMaker AI. This release marks a transformative leap in GPU-accelerated inference, paving the way for organizations to deploy powerful open-source foundation models (FMs) with enhanced efficiency and performance.
Tailored Performance: A Closer Look at G7e Instances
G7e instances are designed with flexibility and capability in mind. You can provision nodes with configurations of 1, 2, 4, and even 8 RTX PRO 6000 GPUs, each boasting an impressive 96 GB of GDDR7 memory. This configuration allows organizations to host models that were previously confined to multi-node systems with ease, significantly reducing operational complexity while enhancing cost-effectiveness.
Key highlights of these instances include:
-
Twice the GPU Memory: Compared to G6e instances, G7e enables the deployment of large language models (LLMs) at scale, including:
- Up to 35B parameters on a single GPU node (G7e.2xlarge)
- Up to 150B parameters on a 4 GPU node (G7e.24xlarge)
- Up to 300B parameters on an 8 GPU node (G7e.48xlarge)
-
Exceptional Network Throughput: With up to 1600 Gbps of networking throughput, G7e instances provide the high-bandwidth necessary for demanding inference workloads.
Generational Performance Boost
With G7e instances, AWS delivers a remarkable 2.3x inference performance boost over the previous G6e instances, revolutionizing the potential for GPU-accelerated inference in the cloud. Here’s how G7e compares generationally:
| Spec | G5 (g5.48xlarge) | G6e (g6e.48xlarge) | G7e (g7e.48xlarge) |
|---|---|---|---|
| GPU | 8x NVIDIA A10G | 8x NVIDIA L40S | 8x NVIDIA RTX PRO 6000 Blackwell |
| GPU Memory per GPU | 24 GB GDDR6 | 48 GB GDDR6 | 96 GB GDDR7 |
| Total GPU Memory | 192 GB | 384 GB | 768 GB |
| GPU Memory Bandwidth | 600 GB/s GPU | 864 GB/s GPU | 1,597 GB/s GPU |
| Network Bandwidth | 100 Gbps | 400 Gbps | 1,600 Gbps (EFA) |
Use Cases Perfectly Suited for G7e
The unique combination of memory density, bandwidth, and networking capabilities makes G7e ideal for a wide range of generative AI workloads:
-
Chatbots and Conversational AI: Maintain responsive interactive experiences, even under heavy load, with low time-to-first-token (TTFT) and high throughput.
-
Agentic and Tool-Calling Workflows: Dramatically improved CPU-to-GPU bandwidth enhances Retrieval Augmented Generation (RAG) pipelines and agentic workflows.
-
Text Generation and Summarization: G7e’s large GPU memory accommodates extensive contextual information, enabling richer reasoning and reducing truncation.
-
Image and Vision Models: Resolve previously encountered out-of-memory errors, allowing for larger and more complex multimodal models.
-
Physical AI and Scientific Computing: Harness Blackwell-generation compute for applications such as digital twins and 3D simulations.
How to Start: Deployment Walkthrough
To get started with G7e instances on SageMaker AI, ensure you have the necessary prerequisites for deployment. You can clone the relevant repository and utilize the sample notebook to streamline your setup.
Performance Benchmarks: G7e vs. G6e
Benchmarking tests illustrate the generational improvements effectively:
G6e Baseline (ml.g6e.12xlarge):
- Cost: $13.12/hr
- Performance Metrics: Achieved a maximum of 21.5 tokens per second (tok/s) under heavy load (C=32).
G7e (ml.g7e.2xlarge):
- Cost: $4.20/hr
- Performance Metrics: Despite lower individual throughput, G7e maintains a significantly lower cost per token, achieving $0.79 per million tokens under the same load, resulting in a 2.6x cost reduction.
Combined Power: G7e with EAGLE Speculative Decoding
The synergy between G7e hardware and EAGLE speculative decoding yields compounded improvements in both throughput and cost efficiency. By predicting multiple future tokens in a single forward pass, EAGLE enhances the decoding speed while ensuring the output quality remains intact.
Combining benchmarks illustrate that G7e with EAGLE can deliver up to 2.4x throughput improvement and a 75% cost reduction, achieving outstanding results at $0.41 per million output tokens.
Conclusion
The launch of G7e instances on Amazon SageMaker AI signals an exciting evolution in the landscape of generative AI. With a substantial leap in performance, memory, and cost-effectiveness, G7e enables organizations to efficiently deploy complex LLMs and multimodal workloads that were previously unfeasible on a single GPU.
A continuous hardware-software co-optimization path ensures G7e instances remain aligned with the evolving demands of AI applications, setting the stage for advanced generative AI solutions in the future.
For businesses looking to enhance their AI capabilities while keeping costs manageable, the G7e instances represent a remarkable opportunity in the vast world of generative AI.
We can’t wait to see the innovative applications that will emerge from this powerful new infrastructure!