Optimizing Data Loading for Machine Learning Workloads with Amazon S3

Introduction to Amazon S3 and ML Workloads

Performance Bottlenecks in ML Training Pipelines

The Data Loading Challenge

Sequential vs. Random Reads in ML Workloads on Amazon S3

Analyzing Throughput Implications: A Computer Vision Case Study

Optimization Techniques for Data Loading from Amazon S3

Use High-Performance and File Clients Optimized for S3

Shard Datasets and Use Sequential Read Patterns

Parallelization, Prefetching, and Caching of Training Samples

Performance Case Study: Data Loading from Amazon S3 Standard

Benchmark Setup

Benchmark Goals

Single-Epoch Benchmark with Random Access

Multi-Epoch Benchmark with Random Access

Single-Epoch Benchmark with Sequential Access

Entitlement Benchmark with Sequential Access

Conclusion

About the Authors

Optimizing Machine Learning Training with Amazon S3

Amazon Simple Storage Service (Amazon S3) has established itself as a cornerstone for data storage and retrieval, particularly for machine learning (ML) tasks due to its highly elastic nature. In this blog post, we’ll dive deep into practical techniques for optimizing throughput when using Amazon S3 for ML training workloads, particularly with computer vision tasks that involve large volumes of small files. We’ll explore real-world benchmarking results that validate our optimization strategies and their implications for training efficiency.

Why Amazon S3?

Amazon S3 is designed to scale with application demand, providing the essential throughput performance necessary for modern ML workloads. High-performance connectors, such as the Amazon S3 Connector for PyTorch and Mountpoint for Amazon S3, simplify the integration with training pipelines, allowing developers to bypass the complexities of interacting directly with S3’s REST APIs.

Performance Bottlenecks in ML Training Pipelines

While GPUs play a critical role in speeding up ML computations, the bottlenecks often lie in the data input pipeline, which consists of several interdependent stages:

Reading training samples from persistent storage into memory.
Pre-processing training samples in memory (decoding, transforming, augmenting).
Updating model parameters based on computed gradients.
Saving training checkpoints for fault tolerance.

The slowest step in this pipeline defines the overall throughput. Given the decoupling of compute and storage resources in the cloud, stages 1 and 2 often become the critical bottlenecks in cloud-based ML workflows.

The Data Loading Challenge

Access patterns (sequential vs. random) significantly impact the performance of data loading from Amazon S3. Sequential reads, where datasets are read in order, are generally more efficient than random reads, which involve multiple S3 requests for scattered files.

Sequential vs. Random Reads

When reading data from Amazon S3, random reads require multiple requests, increasing latency due to the time-to-first-byte (TTFB) overhead for each GET request. Conversely, sequential reads utilize larger file shards, allowing multiple training samples to be fetched in a single S3 GET request, maximizing throughput.

Case Study: Computer Vision Task

To validate the effectiveness of these access patterns, we benchmarked an image classification workload using tens of thousands of small JPEG files. We implemented random reads from small files directly on S3 versus sequential reads from consolidated, larger file shards. The results were telling: moving from random to sequential reads led to considerable improvements in throughput.

Optimization Techniques for Data Loading from Amazon S3

Here are effective techniques to optimize your data ingestion pipeline when accessing data from Amazon S3:

1. Use High-Performance S3 Clients

Utilizing native open-source S3 clients like Mountpoint for Amazon S3 and the Amazon S3 Connector for PyTorch can vastly increase throughput. Both clients include optimizations such as request parallelization, retries, and connection reuse to reduce overhead.

2. Shard and Organize Datasets

Consolidate datasets into larger file shards, ideally between 100 MB and 1 GB. This allows for sequential reads, which drastically improves throughput compared to retrieving many small files.

3. Optimize Parallelization and Prefetching

Utilizing parallel processing in the data loading phase minimizes I/O bottlenecks. For sequential access, align the number of worker threads with CPU cores. For random access, utilizing more threads than available CPU cores has been shown to yield better throughput.

4. Leverage Caching

Implement caching for frequently accessed datasets, particularly in multi-epoch training. Tools like Mountpoint for Amazon S3 can cache frequently accessed objects locally, reducing the demand for repeated S3 GET requests.

Benchmark Results: Data Loading from Amazon S3 Standard

We conducted an extensive benchmarking exercise using an Amazon EC2 g5.8xlarge instance to simulate a realistic computer vision training workload. Each S3 client—ranging from traditional options to modern optimizations—was evaluated based on performance metrics.

Key Findings:

Random Access Performance:
- Clients without caching exhibited significant bottlenecks at low worker counts.
- The S3 Connector for PyTorch reached near GPU saturation with strategic parallelization.
Sequential Access Optimization:
- Consistently flat GPU utilization and low CPU usage demonstrated superb throughput when utilizing larger file shards.

Conclusion

To fully leverage the capabilities of cloud-based ML frameworks, it’s crucial to prioritize data ingestion optimization. Our findings demonstrate that thoughtful strategies—concerning data loading patterns, appropriate client usage, and intelligent dataset organization—can drastically reduce idle GPU time and enhance training throughput.

In a field rapidly evolving toward larger and more complex datasets, revisiting your data loading pipeline design will yield significant cost efficiency and accelerated time-to-results.

About the Authors

Dr. Alexander Arzhanov: Senior AI/ML Specialist Solutions Architect at AWS, aiding customers in ML solution design across EMEA.
Ilya Isaev: Software Engineer in Amazon S3, focused on efficient data management strategies for ML workloads.
Roy Allela: Senior AI/ML Specialist Solutions Architect, dedicated to optimizing foundation models for various AWS customers.

As you continue to explore the intersections of machine learning and cloud technologies, consider these optimizations as vital tools for your evolving data strategies.

Exclusive Content:

Best Practices for Loading Data for Machine Learning Training with Amazon S3 Clients

Optimizing Data Loading for Machine Learning Workloads with Amazon S3

Introduction to Amazon S3 and ML Workloads

Performance Bottlenecks in ML Training Pipelines

The Data Loading Challenge

Sequential vs. Random Reads in ML Workloads on Amazon S3

Analyzing Throughput Implications: A Computer Vision Case Study

Optimization Techniques for Data Loading from Amazon S3

Use High-Performance and File Clients Optimized for S3

Shard Datasets and Use Sequential Read Patterns

Parallelization, Prefetching, and Caching of Training Samples

Performance Case Study: Data Loading from Amazon S3 Standard

Benchmark Setup

Benchmark Goals

Single-Epoch Benchmark with Random Access

Multi-Epoch Benchmark with Random Access

Single-Epoch Benchmark with Sequential Access

Entitlement Benchmark with Sequential Access

Conclusion

About the Authors

Optimizing Machine Learning Training with Amazon S3

Why Amazon S3?

Performance Bottlenecks in ML Training Pipelines

The Data Loading Challenge

Sequential vs. Random Reads

Case Study: Computer Vision Task

Optimization Techniques for Data Loading from Amazon S3

1. Use High-Performance S3 Clients

2. Shard and Organize Datasets

3. Optimize Parallelization and Prefetching

4. Leverage Caching

Benchmark Results: Data Loading from Amazon S3 Standard

Key Findings:

Conclusion

About the Authors

Latest

Don't miss

Popular categories

Most recent

Most popular

Subscribe