Announcing G7e Instances: Next-Generation GPU-Accelerated Inference on Amazon SageMaker AI

Unlocking Enhanced Performance and Cost-Effectiveness for Generative AI Workloads

Key Features and Benefits of G7e Instances

Comparative Analysis: G7e vs. Previous Generations

Use Cases: Maximizing the Potential of G7e Instances

Deployment and Performance Insights

Financial Efficiency: Pricing & Cost Comparisons

Conclusion: The Future of Cost-Effective Generative AI Inference

Meet the Contributors Behind G7e Innovations

Unleashing the Power of Generative AI: Announcing G7e Instances on Amazon SageMaker AI

As the demand for generative AI skyrockets, developers and enterprises are in constant pursuit of flexible, cost-effective, and robust solutions to meet their diverse needs. Today, we are excited to announce a significant advancement in this arena: G7e instances powered by NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs on Amazon SageMaker AI. This release marks a transformative leap in GPU-accelerated inference, paving the way for organizations to deploy powerful open-source foundation models (FMs) with enhanced efficiency and performance.

Tailored Performance: A Closer Look at G7e Instances

G7e instances are designed with flexibility and capability in mind. You can provision nodes with configurations of 1, 2, 4, and even 8 RTX PRO 6000 GPUs, each boasting an impressive 96 GB of GDDR7 memory. This configuration allows organizations to host models that were previously confined to multi-node systems with ease, significantly reducing operational complexity while enhancing cost-effectiveness.

Key highlights of these instances include:

Twice the GPU Memory: Compared to G6e instances, G7e enables the deployment of large language models (LLMs) at scale, including:
- Up to 35B parameters on a single GPU node (G7e.2xlarge)
- Up to 150B parameters on a 4 GPU node (G7e.24xlarge)
- Up to 300B parameters on an 8 GPU node (G7e.48xlarge)
Exceptional Network Throughput: With up to 1600 Gbps of networking throughput, G7e instances provide the high-bandwidth necessary for demanding inference workloads.

Generational Performance Boost

With G7e instances, AWS delivers a remarkable 2.3x inference performance boost over the previous G6e instances, revolutionizing the potential for GPU-accelerated inference in the cloud. Here’s how G7e compares generationally:

Spec	G5 (g5.48xlarge)	G6e (g6e.48xlarge)	G7e (g7e.48xlarge)
GPU	8x NVIDIA A10G	8x NVIDIA L40S	8x NVIDIA RTX PRO 6000 Blackwell
GPU Memory per GPU	24 GB GDDR6	48 GB GDDR6	96 GB GDDR7
Total GPU Memory	192 GB	384 GB	768 GB
GPU Memory Bandwidth	600 GB/s GPU	864 GB/s GPU	1,597 GB/s GPU
Network Bandwidth	100 Gbps	400 Gbps	1,600 Gbps (EFA)

Use Cases Perfectly Suited for G7e

The unique combination of memory density, bandwidth, and networking capabilities makes G7e ideal for a wide range of generative AI workloads:

Chatbots and Conversational AI: Maintain responsive interactive experiences, even under heavy load, with low time-to-first-token (TTFT) and high throughput.
Agentic and Tool-Calling Workflows: Dramatically improved CPU-to-GPU bandwidth enhances Retrieval Augmented Generation (RAG) pipelines and agentic workflows.
Text Generation and Summarization: G7e’s large GPU memory accommodates extensive contextual information, enabling richer reasoning and reducing truncation.
Image and Vision Models: Resolve previously encountered out-of-memory errors, allowing for larger and more complex multimodal models.
Physical AI and Scientific Computing: Harness Blackwell-generation compute for applications such as digital twins and 3D simulations.

How to Start: Deployment Walkthrough

To get started with G7e instances on SageMaker AI, ensure you have the necessary prerequisites for deployment. You can clone the relevant repository and utilize the sample notebook to streamline your setup.

Performance Benchmarks: G7e vs. G6e

Benchmarking tests illustrate the generational improvements effectively:

G6e Baseline (ml.g6e.12xlarge):

Cost: $13.12/hr
Performance Metrics: Achieved a maximum of 21.5 tokens per second (tok/s) under heavy load (C=32).

G7e (ml.g7e.2xlarge):

Cost: $4.20/hr
Performance Metrics: Despite lower individual throughput, G7e maintains a significantly lower cost per token, achieving $0.79 per million tokens under the same load, resulting in a 2.6x cost reduction.

Combined Power: G7e with EAGLE Speculative Decoding

The synergy between G7e hardware and EAGLE speculative decoding yields compounded improvements in both throughput and cost efficiency. By predicting multiple future tokens in a single forward pass, EAGLE enhances the decoding speed while ensuring the output quality remains intact.

Combining benchmarks illustrate that G7e with EAGLE can deliver up to 2.4x throughput improvement and a 75% cost reduction, achieving outstanding results at $0.41 per million output tokens.

Conclusion

The launch of G7e instances on Amazon SageMaker AI signals an exciting evolution in the landscape of generative AI. With a substantial leap in performance, memory, and cost-effectiveness, G7e enables organizations to efficiently deploy complex LLMs and multimodal workloads that were previously unfeasible on a single GPU.

A continuous hardware-software co-optimization path ensures G7e instances remain aligned with the evolving demands of AI applications, setting the stage for advanced generative AI solutions in the future.

For businesses looking to enhance their AI capabilities while keeping costs manageable, the G7e instances represent a remarkable opportunity in the vast world of generative AI.

We can’t wait to see the innovative applications that will emerge from this powerful new infrastructure!

Exclusive Content:

Enhance Generative AI Inference on Amazon SageMaker with G7e Instances