SageMaker Built-in Algorithms

SageMaker provides a library of built-in algorithms optimized for AWS infrastructure. These algorithms are pre-packaged in Docker containers, support distributed training, and are tuned for performance on SageMaker. Using a built-in algorithm eliminates the need to write training code from scratch while still giving you control over hyperparameters and data handling.

Supervised Learning Algorithms

Classification and Regression

Algorithm	Use Case	Input Format	Key Details
XGBoost	Tabular data — the go-to for most structured problems	CSV, LibSVM, Parquet	Use `scale_pos_weight` for imbalanced classes. Supports both classification and regression
Linear Learner	Linear/logistic regression on high-dimensional tabular data	RecordIO, CSV	Normalizes data automatically. Handles binary classification, multiclass, and regression
k-NN	Classify based on nearest neighbors, find similar items	RecordIO, CSV	Supervised (unlike K-Means which is unsupervised clustering)
Factorization Machines	Recommendation systems, click-through prediction, sparse data	RecordIO (protobuf)	Excels with high-dimensional sparse data by capturing feature interactions

Time-Series Forecasting

Algorithm	Use Case	Input Format	Key Details
DeepAR	Forecasting across multiple related time series	JSON Lines, Parquet	Handles cold-start (new products), missing values (NaN), and produces probabilistic forecasts (quantiles)

Computer Vision

Algorithm	Use Case	Input Format	Key Details
Image Classification	Classify images into categories (ResNet-based CNN)	RecordIO, Augmented Manifest	Supports transfer learning — full training or fine-tuning top layers only
Object Detection	Detect and locate objects with bounding boxes	RecordIO, Augmented Manifest	Returns bounding boxes, class labels, and confidence scores
Semantic Segmentation	Pixel-level labeling of images (e.g., road vs. car vs. sidewalk)	Augmented Manifest (Image + Annotation)	Pixel-level precision. Use when bounding boxes are not detailed enough

Natural Language Processing

Algorithm	Use Case	Input Format	Key Details
BlazingText	Text classification (supervised) or word embeddings (unsupervised/Word2Vec)	Augmented manifest (supervised), plain text (unsupervised)	Unsupervised modes: CBOW, Skip-gram, Batch Skip-gram
Seq2Seq	Machine translation, text summarization, speech-to-text	RecordIO (protobuf)	Encoder-decoder architecture: input sequence maps to output sequence
Object2Vec	Create embeddings for pairs of objects — relationship modeling	JSON Lines	Generalizes Word2Vec to arbitrary objects (sentences, customers, products)

Unsupervised Learning Algorithms

Algorithm	Use Case	Input Format	Key Details
K-Means	Cluster similar data points — customer segmentation	RecordIO, CSV	Use the elbow method to find optimal k. Often paired with PCA
PCA	Dimensionality reduction — reduce features while preserving variance	RecordIO, CSV	Data must be scaled first. PCA is unsupervised — it does NOT provide feature importance relative to a target
Random Cut Forest (RCF)	Anomaly detection in datasets	RecordIO, CSV	Also available in Kinesis Data Analytics for real-time streaming anomaly detection
Neural Topic Model (NTM)	Discover topics in text document collections	RecordIO, CSV (bag of words)	Neural-network-based alternative to LDA
LDA	Discover topics in text documents	RecordIO, CSV (bag of words)	For text topic modeling only — not for structured/tabular data
IP Insights	Detect anomalous IP address usage patterns	CSV (user, IP pairs)	Learns normal user-IP associations and flags unusual access

Algorithm Selection Quick Reference

Use this guide to quickly match your problem to the right algorithm:

Problem Type	First Choice
Tabular classification/regression	XGBoost
Time-series forecasting	DeepAR
Text classification	BlazingText (supervised mode)
Word embeddings	BlazingText (unsupervised mode)
Image classification	Image Classification
Object detection in images	Object Detection
Pixel-level image labeling	Semantic Segmentation
Clustering / grouping	K-Means
Dimensionality reduction	PCA
Anomaly detection	Random Cut Forest
Recommendations (sparse data)	Factorization Machines
Topic modeling in text	NTM or LDA
Translation / summarization	Seq2Seq

When to Use

Choose built-in algorithms when your problem fits one of the supported categories and you want optimized, distributed training without writing custom code. If you need a framework not covered by built-in algorithms (e.g., a custom PyTorch architecture), use SageMaker's BYOC (Bring Your Own Container) pattern with ECR.

Flashcards

1 / 10

Question

Which SageMaker algorithm is the default choice for tabular/structured data problems?

Click to reveal

Answer

XGBoost. It supports both classification and regression, handles CSV/LibSVM/Parquet, and offers scale_pos_weight for imbalanced classes.

Common Misconception

LDA (Latent Dirichlet Allocation) and NTM are designed exclusively for topic modeling on text documents. They do not work on structured/tabular data. If you need to find patterns in tabular data, use K-Means for clustering or PCA for dimensionality reduction.

Supervised Learning Algorithms​

Classification and Regression​

Time-Series Forecasting​

Computer Vision​

Natural Language Processing​

Unsupervised Learning Algorithms​

Algorithm Selection Quick Reference​

When to Use​

Flashcards​