Compute and Containers
While SageMaker abstracts most infrastructure decisions, understanding the underlying compute options is essential for optimizing cost and performance. This section covers EC2 instance types for ML, container workflows, and strategies for reducing training costs.
Overview
| Service | What It Does | When to Use |
|---|---|---|
| Amazon EC2 | Virtual servers with GPU options (P3, P4, G4) for ML training and inference | Custom ML environments needing full infrastructure control |
| Amazon ECR | Docker container registry | Store custom Docker images for SageMaker training/inference (BYOC pattern) |
| Amazon ECS / EKS | Container orchestration (ECS = AWS-native, EKS = Kubernetes) | Run containerized ML workloads outside SageMaker |
| AWS Fargate | Serverless compute for containers | Run containers without managing servers |
| AWS Batch | Managed batch computing with scheduling | Large-scale batch processing, HPC. Supports GPU and Spot Instances |
| Deep Learning AMIs (DLAMI) | Pre-configured EC2 AMIs with ML frameworks (TensorFlow, PyTorch, MXNet) | Quick-start ML development on EC2 with pre-installed CUDA, cuDNN, and frameworks |
EC2 Instance Types for ML
| Instance Family | GPU | Best For |
|---|---|---|
| P3 | NVIDIA V100 | Training (general deep learning) |
| P4 | NVIDIA A100 | Training (large-scale, latest generation) |
| G4 | NVIDIA T4 | Inference (cost-effective GPU inference) |
| Inf1 | AWS Inferentia | Inference (custom AWS ML chip, best price-performance) |
The BYOC Pattern (Bring Your Own Container)
When SageMaker's built-in algorithms or pre-built framework containers do not meet your needs, use the BYOC pattern:
- Build a Docker image with your custom algorithm or framework
- Push the image to Amazon ECR
- Reference the ECR image URI in your SageMaker Training Job or Endpoint configuration
This pattern gives you full control over the training and inference environment while still leveraging SageMaker's managed infrastructure.
Training Cost Optimization
| Strategy | How It Helps |
|---|---|
| Spot Instances + Checkpointing | Save up to 90% on training. SageMaker Managed Spot Training handles interruptions; checkpointing saves progress so training resumes instead of restarting |
| Pipe/FastFile mode | Stream data from S3 during training instead of downloading — faster startup, lower storage needs |
| Training Compiler | Optimizes deep learning computation graphs for PyTorch and TensorFlow — up to 50% faster training without code changes |
| Elastic Inference | Attach fractional GPU to a CPU instance for inference — right-size GPU allocation when a full GPU is underutilized |
When to Use
Use SageMaker's managed training infrastructure for most ML workloads — it handles provisioning, scaling, and cleanup automatically. Drop down to raw EC2 or AWS Batch when you need full control over the environment, existing Hadoop/Spark integration, or HPC-style batch processing. Use the BYOC pattern with ECR when you need custom frameworks or algorithms in SageMaker.
Flashcards
What is the BYOC pattern in SageMaker?
Click to revealBring Your Own Container: Build a Docker image → Push to ECR → Reference in SageMaker. Used when built-in algorithms or pre-built framework containers don't meet your needs.
For GPU-based inference, evaluate whether a full GPU is actually needed. If GPU utilization is low, Elastic Inference (fractional GPU on a CPU instance) or AWS Inferentia (Inf1 instances) can deliver significant cost savings compared to P3/P4 instances.