Storage
Every ML workflow revolves around data, and choosing the right storage service affects training speed, cost, and architecture complexity. Amazon S3 is the center of gravity for ML on AWS, but specialized storage services like EFS and FSx for Lustre solve specific performance challenges.
Overview
| Service | What It Does | When to Use | Key Details |
|---|---|---|---|
| Amazon S3 | Object storage — the universal ML data lake with unlimited scale | Default for all ML data. SageMaker reads training data from S3 and writes model artifacts to S3 | Versioning for cross-region replication. Standard-IA has the same 11-9s durability as Standard. Per-prefix limit of 3,500 PUT/s |
| Amazon EBS | Block storage attached to EC2 instances | SageMaker notebook storage, EC2-attached storage | Single-instance, single-AZ. Not a data lake. Not for shared ML data |
| Amazon EFS | Managed NFS file system shared across instances | Shared file storage for training data across multiple training instances | Can be mounted to SageMaker training instances. Good for large shared datasets needing file system access |
| Amazon FSx for Lustre | High-performance parallel file system that integrates with S3 | High-throughput training data access, HPC workloads | Faster than S3 direct access. Links to an S3 bucket. Use when training I/O is the bottleneck |
Storage Comparison for ML Training
| S3 | EBS | EFS | FSx for Lustre | |
|---|---|---|---|---|
| Access pattern | Object (GET/PUT) | Block (mount to one instance) | File (NFS, mount to many) | File (POSIX, mount to many) |
| Shared access | Via API | No (single instance) | Yes | Yes |
| Performance | Good (Pipe/FastFile modes) | High IOPS per instance | Moderate throughput | Very high throughput |
| Best for | Data lake, model artifacts | Notebook local storage | Multi-instance shared data | I/O-bound distributed training |
| SageMaker integration | Native (all modes) | Notebook instances | Training instances | Training instances |
S3 as the ML Data Lake
S3 is the default storage for nearly every ML component:
- Training data: SageMaker Training Jobs read from S3 (File, Pipe, or FastFile mode)
- Model artifacts: Training Jobs write
model.tar.gzto S3 - Feature Store offline: SageMaker Feature Store persists offline features to S3
- Data catalog: Glue Crawlers discover schemas from S3 data
- Streaming landing zone: Kinesis Firehose delivers to S3
S3 Cost Optimization
| Strategy | How It Helps |
|---|---|
| Lifecycle rules | Automatically move older data to cheaper tiers (Standard → IA → Glacier) |
| Intelligent Tiering | Automatically moves objects between tiers based on access patterns |
| Parquet format | Columnar format reduces storage size and speeds up queries (Athena scans less data) |
| Partitioning | Organize data by date/category so queries scan fewer files |
When to Use
Always start with S3 as your default ML data store. Add EFS when multiple training instances need shared file access. Use FSx for Lustre when training is I/O-bound and you need maximum throughput. EBS is primarily for notebook local storage.
Flashcards
Why is Amazon S3 the default storage for ML on AWS?
Click to revealS3 provides unlimited scale, high durability (11-9s), native SageMaker integration (File/Pipe/FastFile modes), and serves as the hub connecting all ML services — training data, model artifacts, feature store, streaming landing zone.
When SageMaker training is slow due to data loading, consider these acceleration options in order: switch to Pipe mode or FastFile mode for streaming from S3, mount an EFS file system, or use FSx for Lustre as a high-performance cache in front of S3.