Optimizing Data Loading for Machine Learning Workloads with Amazon S3
Introduction to Amazon S3 and ML Workloads
Performance Bottlenecks in ML Training Pipelines
The Data Loading Challenge
Sequential vs. Random Reads in ML Workloads on Amazon S3
Analyzing Throughput Implications: A Computer Vision Case Study
Optimization Techniques for Data Loading from Amazon S3
Use High-Performance and File Clients Optimized for S3
Shard Datasets and Use Sequential Read Patterns
Parallelization, Prefetching, and Caching of Training Samples
Performance Case Study: Data Loading from Amazon S3 Standard
Benchmark Setup
Benchmark Goals
Single-Epoch Benchmark with Random Access
Multi-Epoch Benchmark with Random Access
Single-Epoch Benchmark with Sequential Access
Entitlement Benchmark with Sequential Access
Conclusion
About the Authors
Optimizing Machine Learning Training with Amazon S3
Amazon Simple Storage Service (Amazon S3) has established itself as a cornerstone for data storage and retrieval, particularly for machine learning (ML) tasks due to its highly elastic nature. In this blog post, we’ll dive deep into practical techniques for optimizing throughput when using Amazon S3 for ML training workloads, particularly with computer vision tasks that involve large volumes of small files. We’ll explore real-world benchmarking results that validate our optimization strategies and their implications for training efficiency.
Why Amazon S3?
Amazon S3 is designed to scale with application demand, providing the essential throughput performance necessary for modern ML workloads. High-performance connectors, such as the Amazon S3 Connector for PyTorch and Mountpoint for Amazon S3, simplify the integration with training pipelines, allowing developers to bypass the complexities of interacting directly with S3’s REST APIs.
Performance Bottlenecks in ML Training Pipelines
While GPUs play a critical role in speeding up ML computations, the bottlenecks often lie in the data input pipeline, which consists of several interdependent stages:
- Reading training samples from persistent storage into memory.
- Pre-processing training samples in memory (decoding, transforming, augmenting).
- Updating model parameters based on computed gradients.
- Saving training checkpoints for fault tolerance.
The slowest step in this pipeline defines the overall throughput. Given the decoupling of compute and storage resources in the cloud, stages 1 and 2 often become the critical bottlenecks in cloud-based ML workflows.
The Data Loading Challenge
Access patterns (sequential vs. random) significantly impact the performance of data loading from Amazon S3. Sequential reads, where datasets are read in order, are generally more efficient than random reads, which involve multiple S3 requests for scattered files.
Sequential vs. Random Reads
When reading data from Amazon S3, random reads require multiple requests, increasing latency due to the time-to-first-byte (TTFB) overhead for each GET request. Conversely, sequential reads utilize larger file shards, allowing multiple training samples to be fetched in a single S3 GET request, maximizing throughput.
Case Study: Computer Vision Task
To validate the effectiveness of these access patterns, we benchmarked an image classification workload using tens of thousands of small JPEG files. We implemented random reads from small files directly on S3 versus sequential reads from consolidated, larger file shards. The results were telling: moving from random to sequential reads led to considerable improvements in throughput.
Optimization Techniques for Data Loading from Amazon S3
Here are effective techniques to optimize your data ingestion pipeline when accessing data from Amazon S3:
1. Use High-Performance S3 Clients
Utilizing native open-source S3 clients like Mountpoint for Amazon S3 and the Amazon S3 Connector for PyTorch can vastly increase throughput. Both clients include optimizations such as request parallelization, retries, and connection reuse to reduce overhead.
2. Shard and Organize Datasets
Consolidate datasets into larger file shards, ideally between 100 MB and 1 GB. This allows for sequential reads, which drastically improves throughput compared to retrieving many small files.
3. Optimize Parallelization and Prefetching
Utilizing parallel processing in the data loading phase minimizes I/O bottlenecks. For sequential access, align the number of worker threads with CPU cores. For random access, utilizing more threads than available CPU cores has been shown to yield better throughput.
4. Leverage Caching
Implement caching for frequently accessed datasets, particularly in multi-epoch training. Tools like Mountpoint for Amazon S3 can cache frequently accessed objects locally, reducing the demand for repeated S3 GET requests.
Benchmark Results: Data Loading from Amazon S3 Standard
We conducted an extensive benchmarking exercise using an Amazon EC2 g5.8xlarge instance to simulate a realistic computer vision training workload. Each S3 client—ranging from traditional options to modern optimizations—was evaluated based on performance metrics.
Key Findings:
-
Random Access Performance:
- Clients without caching exhibited significant bottlenecks at low worker counts.
- The S3 Connector for PyTorch reached near GPU saturation with strategic parallelization.
-
Sequential Access Optimization:
- Consistently flat GPU utilization and low CPU usage demonstrated superb throughput when utilizing larger file shards.
Conclusion
To fully leverage the capabilities of cloud-based ML frameworks, it’s crucial to prioritize data ingestion optimization. Our findings demonstrate that thoughtful strategies—concerning data loading patterns, appropriate client usage, and intelligent dataset organization—can drastically reduce idle GPU time and enhance training throughput.
In a field rapidly evolving toward larger and more complex datasets, revisiting your data loading pipeline design will yield significant cost efficiency and accelerated time-to-results.
About the Authors
-
Dr. Alexander Arzhanov: Senior AI/ML Specialist Solutions Architect at AWS, aiding customers in ML solution design across EMEA.
-
Ilya Isaev: Software Engineer in Amazon S3, focused on efficient data management strategies for ML workloads.
-
Roy Allela: Senior AI/ML Specialist Solutions Architect, dedicated to optimizing foundation models for various AWS customers.
As you continue to explore the intersections of machine learning and cloud technologies, consider these optimizations as vital tools for your evolving data strategies.