Collaborative Innovations in Seismic Foundation Model Training: A Partnership Between TGS and AWS

Enhancing Energy Sector Workflows with Advanced Seismic Data Analysis

Addressing Seismic Foundation Model Training Challenges

Solution Overview

Optimizing the Training Data Pipeline

Selecting the Distributed Training Framework

Expanding Analytical Capabilities

Results and Impact

Lessons Learned and Best Practices

Conclusion

Acknowledgment

About the Authors

Revolutionizing Seismic Foundation Models with AWS: TGS’s Journey

Coauthored by: Haotian An, Manoj Alwani, Debby Wehner, Altay Sansal, and Alejandro Valenciano

In the ever-evolving energy sector, the need for cutting-edge geoscience data analysis is paramount. TGS, a leader in geoscience data provision, has taken significant strides by modernizing its seismic foundation models (SFMs) through a strategic partnership with the AWS Generative AI Innovation Center (GenAIIC). This collaboration is paving the way for future-focused exploration and production workflows that leverage advanced seismic data analysis techniques.

The Challenge of Training Seismic Foundation Models

Training SFMs entails navigating complexities inherent in handling large volumes of proprietary 3D seismic data. Here’s a closer look at the challenges TGS faced in this endeavor:

1. Data Scale and Complexity

Large datasets in domain-specific formats necessitate efficient streaming tactics to prevent GPU idle time during training, ensuring high throughput.

2. Training Efficiency

The computational intensity of training large foundation models on 3D volumetric data means any acceleration can lead to faster iterations and improved deliverables for clients.

3. Expanded Analytical Capabilities

To maximize geological analysis, it’s essential to expand the model’s context windows, enabling the capture of both local details and broader geological patterns.

Recognizing these challenges spotlighted the need for a more streamlined and robust training approach, leading to the deepened partnership with AWS GenAIIC.

The Innovative Solution

The collaboration refined TGS’s SFMs by focusing on three critical areas:

Establishing an Efficient Data Pipeline
Optimizing Distributed Training Across Multiple Nodes
Expanding the Model’s Context Window for Larger Geological Volumes

Infrastructure Optimization

Through the successfully deployed SageMaker HyperPod, TGS built a resilient and scalable training infrastructure. This setup featured:

16 Amazon EC2 P5 instances for worker nodes, equipped with cutting-edge NVIDIA H200 GPUs.
Streaming capabilities allowing terabytes of training data to flow directly from Amazon S3, bypassing intermediate storage layers.

Training Data Pipeline Optimization

TGS utilized the MDIO format for training datasets and evaluated two main approaches for delivering data:

Amazon FSx for Lustre
Direct Streaming from Amazon S3

Opting for direct streaming provided significant advantages in throughput, scaling, and cost efficiency. This innovation eliminated the need for expensive provisioning and offered seamless performance across nodes.

Distributed Training Framework Selection

Choosing the right distributed training framework was crucial. After rigorous testing, the team settled on the DeepSpeed ZeRO-2 optimizer, which cleverly partitions gradients and optimizer states across GPUs while maintaining efficient communication.

Achievements: Dramatic Improvements in Training Performance

These enhancements led to groundbreaking results:

Reduced Training Time: The collaboration cut the training duration from 6 months to just 5 days.
Near-Linear Scaling: Demonstrated strong scalability across 128 GPUs, achieving 90–95% parallel efficiency.
Expanded Analytical Capabilities: Context parallelism allowed SFMs to analyze larger geological 3D volumes, capturing intricate details and broad patterns simultaneously.

Expanding Context and Analytical Capabilities

One of the standout accomplishments was the expansion of the model’s field of view. The new architecture enabled:

Maximum Input Size: Increased from 640 × 640 × 1,024 to 1,536 × 1,536 × 2,048 voxels.
Context Length: Extended from 102,400 tokens to 1,170,000 tokens—a significant leap enabling much richer geological insights.

With these capabilities, TGS’s models can now identify features that were previously undetectable, delivering enhanced analytics to clients.

Key Lessons Learned

The collaboration revealed several best practices for organizations tackling similar challenges:

Systematic Scaling Approach: Base configurations should begin with single-node setups and gradually expand.
Data Pipeline Optimization is Critical: Direct streaming efficiently met data throughput needs.
Batch Size Tuning Requires Nuance: Optimal throughput was reached through careful testing and adjustments, illustrating that larger batches don’t always equate to better performance.
Framework Selection Matters: Choosing among training frameworks depends heavily on specific model characteristics and resource availability.
Incremental Validation is Essential: Initial tests at smaller scales can preemptively identify optimal configurations, saving time and costs.

Conclusion: A Bright Future for TGS

The partnership with AWS GenAIIC has not only resulted in an optimized infrastructure for training SFMs but also set a strong foundation for TGS’s continued growth in AI-driven analytics. The collaborative innovations, particularly in adapting context parallelism for Vision Transformer architectures, exemplify the promise of applying advanced AI techniques to specialized scientific sectors.

As TGS looks towards the horizon, there is immense potential for future developments, including multi-modal integration and temporal analysis that will further enhance its offerings in the energy sector.

If you’re interested in supercharging your own foundational model training workflows, consider exploring SageMaker HyperPod for resilient distributed training infrastructure. Additionally, the AWS Generative AI Innovation Center invites organizations to leverage its expertise to propel their AI initiatives forward.

Acknowledgments: Special thanks to team members who contributed significantly to this project, advancing the boundaries of geoscience and AI.

For a deeper dive into how these innovations can transform your organization, stay connected with insights from TGS and AWS.

Exclusive Content:

Scaling Seismic Foundation Models on AWS: Distributed Training with Amazon SageMaker HyperPod and Enhanced Context Windows