Decision Trees — When to Use What

This page consolidates all the "when to use what" guidance into a single reference. Use these decision tables when you need to quickly identify the right AWS service for a given ML scenario.

Data Preparation

Scenario	Recommended Service
No-code data prep (non-technical users)	Glue DataBrew
Visual data prep feeding into SageMaker	SageMaker Data Wrangler
Feature importance scores	Data Wrangler Quick Model
Batch ETL at scale (PySpark)	AWS Glue ETL
Schema discovery from S3	Glue Crawlers → Glue Data Catalog
Fuzzy matching / deduplication	Glue FindMatches
Format conversion: CSV → Parquet (batch)	Glue ETL job
Format conversion: JSON → Parquet (streaming)	Kinesis Firehose (native conversion via Glue Catalog)
Format conversion: images → RecordIO	im2rec (MXNet utility)

Data Ingestion

Scenario	Recommended Service
Stream to S3 with transforms (least effort)	Kinesis Data Firehose + Lambda
Real-time custom processing with replay	Kinesis Data Streams
SQL on streaming data	Kinesis Data Analytics
Real-time anomaly detection	KDA + Random Cut Forest
On-premises → S3 with scheduling	AWS DataSync
Petabytes + slow network	Snowball Edge
RDS/DynamoDB → S3 for ML	Glue / Data Pipeline / DMS

Model Training

Scenario	Recommended Service
Standard ML training	SageMaker Training Jobs (data from S3)
AutoML with no expertise	SageMaker Autopilot or Canvas
Reduce training cost	Spot Instances + Checkpointing
Faster data loading	Pipe mode (RecordIO) or FastFile mode
Distributed training	Horovod (data parallel) or SageMaker distributed
Custom framework or algorithm	ECR Docker → SageMaker BYOC
Hyperparameter tuning	SageMaker Automatic Model Tuning

Model Deployment

Scenario	Recommended Service
Steady real-time traffic	SageMaker Real-time Endpoint
Intermittent / unpredictable traffic	SageMaker Serverless Inference
Periodic bulk predictions	SageMaker Batch Transform
Large payloads (up to 1 GB)	SageMaker Async Inference
A/B testing models	Production Variants on a single endpoint
Edge deployment (no internet)	Neo → IoT Greengrass
Choose best instance type	SageMaker Inference Recommender

Monitoring and MLOps

Scenario	Recommended Service
Data drift / model quality monitoring	SageMaker Model Monitor
Training-time debugging	SageMaker Debugger
Model explainability	SageMaker Clarify (SHAP values)
ML workflow automation	SageMaker Pipelines
Model versioning / approval	SageMaker Model Registry
Human review of predictions	Amazon A2I
Automated retraining on drift	Model Monitor → EventBridge → Pipelines

Security

Scenario	Recommended Service
SageMaker access control	IAM execution roles (never access keys)
No internet for SageMaker	VPC + S3 Gateway Endpoint + VPC Interface Endpoints
Encrypt at rest	KMS CMK
Data lake governance	Lake Formation (column-level access)
Audit API calls	CloudTrail
Monitor resource metrics	CloudWatch

AI Service Selection (No ML Expertise Needed)

Task	Recommended Service
Text classification	Comprehend Custom Classifier
Sentiment / entities / key phrases	Comprehend
Image classification (custom)	Rekognition Custom Labels
Face detection / matching	Rekognition
Extract text from documents	Textract
Speech to text	Transcribe
Text to speech	Polly
Language translation	Translate
Chatbots / voice bots	Lex
Time-series forecasting (managed)	Forecast
Product recommendations	Personalize
Fraud detection	Fraud Detector
Enterprise search	Kendra
Generative AI / foundation models	Bedrock
Equipment anomaly detection	Lookout for Equipment

Flashcards

1 / 10

Question

You need to detect anomalies in a real-time data stream. What AWS services do you use?

Click to reveal

Answer

Kinesis Data Analytics with its built-in Random Cut Forest (RCF) function. This is the standard pattern for streaming anomaly detection.

Pro Tip

When evaluating which service to use, consider the effort spectrum. Managed AI services (Comprehend, Rekognition, etc.) require the least effort. SageMaker built-in algorithms are next. Custom SageMaker training (BYOC) requires the most effort but gives the most control. Match the level of effort to the complexity of your problem.

Data Preparation​

Data Ingestion​

Model Training​

Model Deployment​

Monitoring and MLOps​

Security​

AI Service Selection (No ML Expertise Needed)​

Flashcards​