Collaborative Innovations in Seismic Foundation Model Training: A Partnership Between TGS and AWS
Enhancing Energy Sector Workflows with Advanced Seismic Data Analysis
Addressing Seismic Foundation Model Training Challenges
Solution Overview
Optimizing the Training Data Pipeline
Selecting the Distributed Training Framework
Expanding Analytical Capabilities
Results and Impact
Lessons Learned and Best Practices
Conclusion
Acknowledgment
About the Authors
Revolutionizing Seismic Foundation Models with AWS: TGS’s Journey
Coauthored by: Haotian An, Manoj Alwani, Debby Wehner, Altay Sansal, and Alejandro Valenciano
In the ever-evolving energy sector, the need for cutting-edge geoscience data analysis is paramount. TGS, a leader in geoscience data provision, has taken significant strides by modernizing its seismic foundation models (SFMs) through a strategic partnership with the AWS Generative AI Innovation Center (GenAIIC). This collaboration is paving the way for future-focused exploration and production workflows that leverage advanced seismic data analysis techniques.
The Challenge of Training Seismic Foundation Models
Training SFMs entails navigating complexities inherent in handling large volumes of proprietary 3D seismic data. Here’s a closer look at the challenges TGS faced in this endeavor:
1. Data Scale and Complexity
Large datasets in domain-specific formats necessitate efficient streaming tactics to prevent GPU idle time during training, ensuring high throughput.
2. Training Efficiency
The computational intensity of training large foundation models on 3D volumetric data means any acceleration can lead to faster iterations and improved deliverables for clients.
3. Expanded Analytical Capabilities
To maximize geological analysis, it’s essential to expand the model’s context windows, enabling the capture of both local details and broader geological patterns.
Recognizing these challenges spotlighted the need for a more streamlined and robust training approach, leading to the deepened partnership with AWS GenAIIC.
The Innovative Solution
The collaboration refined TGS’s SFMs by focusing on three critical areas:
- Establishing an Efficient Data Pipeline
- Optimizing Distributed Training Across Multiple Nodes
- Expanding the Model’s Context Window for Larger Geological Volumes
Infrastructure Optimization
Through the successfully deployed SageMaker HyperPod, TGS built a resilient and scalable training infrastructure. This setup featured:
- 16 Amazon EC2 P5 instances for worker nodes, equipped with cutting-edge NVIDIA H200 GPUs.
- Streaming capabilities allowing terabytes of training data to flow directly from Amazon S3, bypassing intermediate storage layers.
Training Data Pipeline Optimization
TGS utilized the MDIO format for training datasets and evaluated two main approaches for delivering data:
- Amazon FSx for Lustre
- Direct Streaming from Amazon S3
Opting for direct streaming provided significant advantages in throughput, scaling, and cost efficiency. This innovation eliminated the need for expensive provisioning and offered seamless performance across nodes.
Distributed Training Framework Selection
Choosing the right distributed training framework was crucial. After rigorous testing, the team settled on the DeepSpeed ZeRO-2 optimizer, which cleverly partitions gradients and optimizer states across GPUs while maintaining efficient communication.
Achievements: Dramatic Improvements in Training Performance
These enhancements led to groundbreaking results:
- Reduced Training Time: The collaboration cut the training duration from 6 months to just 5 days.
- Near-Linear Scaling: Demonstrated strong scalability across 128 GPUs, achieving 90–95% parallel efficiency.
- Expanded Analytical Capabilities: Context parallelism allowed SFMs to analyze larger geological 3D volumes, capturing intricate details and broad patterns simultaneously.
Expanding Context and Analytical Capabilities
One of the standout accomplishments was the expansion of the model’s field of view. The new architecture enabled:
- Maximum Input Size: Increased from 640 × 640 × 1,024 to 1,536 × 1,536 × 2,048 voxels.
- Context Length: Extended from 102,400 tokens to 1,170,000 tokens—a significant leap enabling much richer geological insights.
With these capabilities, TGS’s models can now identify features that were previously undetectable, delivering enhanced analytics to clients.
Key Lessons Learned
The collaboration revealed several best practices for organizations tackling similar challenges:
- Systematic Scaling Approach: Base configurations should begin with single-node setups and gradually expand.
- Data Pipeline Optimization is Critical: Direct streaming efficiently met data throughput needs.
- Batch Size Tuning Requires Nuance: Optimal throughput was reached through careful testing and adjustments, illustrating that larger batches don’t always equate to better performance.
- Framework Selection Matters: Choosing among training frameworks depends heavily on specific model characteristics and resource availability.
- Incremental Validation is Essential: Initial tests at smaller scales can preemptively identify optimal configurations, saving time and costs.
Conclusion: A Bright Future for TGS
The partnership with AWS GenAIIC has not only resulted in an optimized infrastructure for training SFMs but also set a strong foundation for TGS’s continued growth in AI-driven analytics. The collaborative innovations, particularly in adapting context parallelism for Vision Transformer architectures, exemplify the promise of applying advanced AI techniques to specialized scientific sectors.
As TGS looks towards the horizon, there is immense potential for future developments, including multi-modal integration and temporal analysis that will further enhance its offerings in the energy sector.
If you’re interested in supercharging your own foundational model training workflows, consider exploring SageMaker HyperPod for resilient distributed training infrastructure. Additionally, the AWS Generative AI Innovation Center invites organizations to leverage its expertise to propel their AI initiatives forward.
Acknowledgments: Special thanks to team members who contributed significantly to this project, advancing the boundaries of geoscience and AI.
For a deeper dive into how these innovations can transform your organization, stay connected with insights from TGS and AWS.