Storage

Every ML workflow revolves around data, and choosing the right storage service affects training speed, cost, and architecture complexity. Amazon S3 is the center of gravity for ML on AWS, but specialized storage services like EFS and FSx for Lustre solve specific performance challenges.

Overview

Service	What It Does	When to Use	Key Details
Amazon S3	Object storage — the universal ML data lake with unlimited scale	Default for all ML data. SageMaker reads training data from S3 and writes model artifacts to S3	Versioning for cross-region replication. Standard-IA has the same 11-9s durability as Standard. Per-prefix limit of 3,500 PUT/s
Amazon EBS	Block storage attached to EC2 instances	SageMaker notebook storage, EC2-attached storage	Single-instance, single-AZ. Not a data lake. Not for shared ML data
Amazon EFS	Managed NFS file system shared across instances	Shared file storage for training data across multiple training instances	Can be mounted to SageMaker training instances. Good for large shared datasets needing file system access
Amazon FSx for Lustre	High-performance parallel file system that integrates with S3	High-throughput training data access, HPC workloads	Faster than S3 direct access. Links to an S3 bucket. Use when training I/O is the bottleneck

Storage Comparison for ML Training

	S3	EBS	EFS	FSx for Lustre
Access pattern	Object (GET/PUT)	Block (mount to one instance)	File (NFS, mount to many)	File (POSIX, mount to many)
Shared access	Via API	No (single instance)	Yes	Yes
Performance	Good (Pipe/FastFile modes)	High IOPS per instance	Moderate throughput	Very high throughput
Best for	Data lake, model artifacts	Notebook local storage	Multi-instance shared data	I/O-bound distributed training
SageMaker integration	Native (all modes)	Notebook instances	Training instances	Training instances

S3 as the ML Data Lake

S3 is the default storage for nearly every ML component:

Training data: SageMaker Training Jobs read from S3 (File, Pipe, or FastFile mode)
Model artifacts: Training Jobs write model.tar.gz to S3
Feature Store offline: SageMaker Feature Store persists offline features to S3
Data catalog: Glue Crawlers discover schemas from S3 data
Streaming landing zone: Kinesis Firehose delivers to S3

S3 Cost Optimization

Strategy	How It Helps
Lifecycle rules	Automatically move older data to cheaper tiers (Standard → IA → Glacier)
Intelligent Tiering	Automatically moves objects between tiers based on access patterns
Parquet format	Columnar format reduces storage size and speeds up queries (Athena scans less data)
Partitioning	Organize data by date/category so queries scan fewer files

When to Use

Always start with S3 as your default ML data store. Add EFS when multiple training instances need shared file access. Use FSx for Lustre when training is I/O-bound and you need maximum throughput. EBS is primarily for notebook local storage.

Flashcards

1 / 6

Question

Why is Amazon S3 the default storage for ML on AWS?

Click to reveal

Answer

S3 provides unlimited scale, high durability (11-9s), native SageMaker integration (File/Pipe/FastFile modes), and serves as the hub connecting all ML services — training data, model artifacts, feature store, streaming landing zone.

Key Insight

When SageMaker training is slow due to data loading, consider these acceleration options in order: switch to Pipe mode or FastFile mode for streaming from S3, mount an EFS file system, or use FSx for Lustre as a high-performance cache in front of S3.

Overview​

Storage Comparison for ML Training​

S3 as the ML Data Lake​

S3 Cost Optimization​

When to Use​

Flashcards​