Skip to main content

Storage

Every ML workflow revolves around data, and choosing the right storage service affects training speed, cost, and architecture complexity. Amazon S3 is the center of gravity for ML on AWS, but specialized storage services like EFS and FSx for Lustre solve specific performance challenges.

Overview

ServiceWhat It DoesWhen to UseKey Details
Amazon S3Object storage — the universal ML data lake with unlimited scaleDefault for all ML data. SageMaker reads training data from S3 and writes model artifacts to S3Versioning for cross-region replication. Standard-IA has the same 11-9s durability as Standard. Per-prefix limit of 3,500 PUT/s
Amazon EBSBlock storage attached to EC2 instancesSageMaker notebook storage, EC2-attached storageSingle-instance, single-AZ. Not a data lake. Not for shared ML data
Amazon EFSManaged NFS file system shared across instancesShared file storage for training data across multiple training instancesCan be mounted to SageMaker training instances. Good for large shared datasets needing file system access
Amazon FSx for LustreHigh-performance parallel file system that integrates with S3High-throughput training data access, HPC workloadsFaster than S3 direct access. Links to an S3 bucket. Use when training I/O is the bottleneck

Storage Comparison for ML Training

S3EBSEFSFSx for Lustre
Access patternObject (GET/PUT)Block (mount to one instance)File (NFS, mount to many)File (POSIX, mount to many)
Shared accessVia APINo (single instance)YesYes
PerformanceGood (Pipe/FastFile modes)High IOPS per instanceModerate throughputVery high throughput
Best forData lake, model artifactsNotebook local storageMulti-instance shared dataI/O-bound distributed training
SageMaker integrationNative (all modes)Notebook instancesTraining instancesTraining instances

S3 as the ML Data Lake

S3 is the default storage for nearly every ML component:

  • Training data: SageMaker Training Jobs read from S3 (File, Pipe, or FastFile mode)
  • Model artifacts: Training Jobs write model.tar.gz to S3
  • Feature Store offline: SageMaker Feature Store persists offline features to S3
  • Data catalog: Glue Crawlers discover schemas from S3 data
  • Streaming landing zone: Kinesis Firehose delivers to S3

S3 Cost Optimization

StrategyHow It Helps
Lifecycle rulesAutomatically move older data to cheaper tiers (Standard → IA → Glacier)
Intelligent TieringAutomatically moves objects between tiers based on access patterns
Parquet formatColumnar format reduces storage size and speeds up queries (Athena scans less data)
PartitioningOrganize data by date/category so queries scan fewer files

When to Use

Always start with S3 as your default ML data store. Add EFS when multiple training instances need shared file access. Use FSx for Lustre when training is I/O-bound and you need maximum throughput. EBS is primarily for notebook local storage.

Flashcards

1 / 6
Question

Why is Amazon S3 the default storage for ML on AWS?

Click to reveal
Answer

S3 provides unlimited scale, high durability (11-9s), native SageMaker integration (File/Pipe/FastFile modes), and serves as the hub connecting all ML services — training data, model artifacts, feature store, streaming landing zone.

Key Insight

When SageMaker training is slow due to data loading, consider these acceleration options in order: switch to Pipe mode or FastFile mode for streaming from S3, mount an EFS file system, or use FSx for Lustre as a high-performance cache in front of S3.