SageMaker Built-in Algorithms
SageMaker provides a library of built-in algorithms optimized for AWS infrastructure. These algorithms are pre-packaged in Docker containers, support distributed training, and are tuned for performance on SageMaker. Using a built-in algorithm eliminates the need to write training code from scratch while still giving you control over hyperparameters and data handling.
Supervised Learning Algorithms
Classification and Regression
| Algorithm | Use Case | Input Format | Key Details |
|---|---|---|---|
| XGBoost | Tabular data — the go-to for most structured problems | CSV, LibSVM, Parquet | Use scale_pos_weight for imbalanced classes. Supports both classification and regression |
| Linear Learner | Linear/logistic regression on high-dimensional tabular data | RecordIO, CSV | Normalizes data automatically. Handles binary classification, multiclass, and regression |
| k-NN | Classify based on nearest neighbors, find similar items | RecordIO, CSV | Supervised (unlike K-Means which is unsupervised clustering) |
| Factorization Machines | Recommendation systems, click-through prediction, sparse data | RecordIO (protobuf) | Excels with high-dimensional sparse data by capturing feature interactions |
Time-Series Forecasting
| Algorithm | Use Case | Input Format | Key Details |
|---|---|---|---|
| DeepAR | Forecasting across multiple related time series | JSON Lines, Parquet | Handles cold-start (new products), missing values (NaN), and produces probabilistic forecasts (quantiles) |
Computer Vision
| Algorithm | Use Case | Input Format | Key Details |
|---|---|---|---|
| Image Classification | Classify images into categories (ResNet-based CNN) | RecordIO, Augmented Manifest | Supports transfer learning — full training or fine-tuning top layers only |
| Object Detection | Detect and locate objects with bounding boxes | RecordIO, Augmented Manifest | Returns bounding boxes, class labels, and confidence scores |
| Semantic Segmentation | Pixel-level labeling of images (e.g., road vs. car vs. sidewalk) | Augmented Manifest (Image + Annotation) | Pixel-level precision. Use when bounding boxes are not detailed enough |
Natural Language Processing
| Algorithm | Use Case | Input Format | Key Details |
|---|---|---|---|
| BlazingText | Text classification (supervised) or word embeddings (unsupervised/Word2Vec) | Augmented manifest (supervised), plain text (unsupervised) | Unsupervised modes: CBOW, Skip-gram, Batch Skip-gram |
| Seq2Seq | Machine translation, text summarization, speech-to-text | RecordIO (protobuf) | Encoder-decoder architecture: input sequence maps to output sequence |
| Object2Vec | Create embeddings for pairs of objects — relationship modeling | JSON Lines | Generalizes Word2Vec to arbitrary objects (sentences, customers, products) |
Unsupervised Learning Algorithms
| Algorithm | Use Case | Input Format | Key Details |
|---|---|---|---|
| K-Means | Cluster similar data points — customer segmentation | RecordIO, CSV | Use the elbow method to find optimal k. Often paired with PCA |
| PCA | Dimensionality reduction — reduce features while preserving variance | RecordIO, CSV | Data must be scaled first. PCA is unsupervised — it does NOT provide feature importance relative to a target |
| Random Cut Forest (RCF) | Anomaly detection in datasets | RecordIO, CSV | Also available in Kinesis Data Analytics for real-time streaming anomaly detection |
| Neural Topic Model (NTM) | Discover topics in text document collections | RecordIO, CSV (bag of words) | Neural-network-based alternative to LDA |
| LDA | Discover topics in text documents | RecordIO, CSV (bag of words) | For text topic modeling only — not for structured/tabular data |
| IP Insights | Detect anomalous IP address usage patterns | CSV (user, IP pairs) | Learns normal user-IP associations and flags unusual access |
Algorithm Selection Quick Reference
Use this guide to quickly match your problem to the right algorithm:
| Problem Type | First Choice |
|---|---|
| Tabular classification/regression | XGBoost |
| Time-series forecasting | DeepAR |
| Text classification | BlazingText (supervised mode) |
| Word embeddings | BlazingText (unsupervised mode) |
| Image classification | Image Classification |
| Object detection in images | Object Detection |
| Pixel-level image labeling | Semantic Segmentation |
| Clustering / grouping | K-Means |
| Dimensionality reduction | PCA |
| Anomaly detection | Random Cut Forest |
| Recommendations (sparse data) | Factorization Machines |
| Topic modeling in text | NTM or LDA |
| Translation / summarization | Seq2Seq |
When to Use
Choose built-in algorithms when your problem fits one of the supported categories and you want optimized, distributed training without writing custom code. If you need a framework not covered by built-in algorithms (e.g., a custom PyTorch architecture), use SageMaker's BYOC (Bring Your Own Container) pattern with ECR.
Flashcards
Which SageMaker algorithm is the default choice for tabular/structured data problems?
Click to revealXGBoost. It supports both classification and regression, handles CSV/LibSVM/Parquet, and offers scale_pos_weight for imbalanced classes.
LDA (Latent Dirichlet Allocation) and NTM are designed exclusively for topic modeling on text documents. They do not work on structured/tabular data. If you need to find patterns in tabular data, use K-Means for clustering or PCA for dimensionality reduction.