Skip to main content

SageMaker Built-in Algorithms

SageMaker provides a library of built-in algorithms optimized for AWS infrastructure. These algorithms are pre-packaged in Docker containers, support distributed training, and are tuned for performance on SageMaker. Using a built-in algorithm eliminates the need to write training code from scratch while still giving you control over hyperparameters and data handling.

Supervised Learning Algorithms

Classification and Regression

AlgorithmUse CaseInput FormatKey Details
XGBoostTabular data — the go-to for most structured problemsCSV, LibSVM, ParquetUse scale_pos_weight for imbalanced classes. Supports both classification and regression
Linear LearnerLinear/logistic regression on high-dimensional tabular dataRecordIO, CSVNormalizes data automatically. Handles binary classification, multiclass, and regression
k-NNClassify based on nearest neighbors, find similar itemsRecordIO, CSVSupervised (unlike K-Means which is unsupervised clustering)
Factorization MachinesRecommendation systems, click-through prediction, sparse dataRecordIO (protobuf)Excels with high-dimensional sparse data by capturing feature interactions

Time-Series Forecasting

AlgorithmUse CaseInput FormatKey Details
DeepARForecasting across multiple related time seriesJSON Lines, ParquetHandles cold-start (new products), missing values (NaN), and produces probabilistic forecasts (quantiles)

Computer Vision

AlgorithmUse CaseInput FormatKey Details
Image ClassificationClassify images into categories (ResNet-based CNN)RecordIO, Augmented ManifestSupports transfer learning — full training or fine-tuning top layers only
Object DetectionDetect and locate objects with bounding boxesRecordIO, Augmented ManifestReturns bounding boxes, class labels, and confidence scores
Semantic SegmentationPixel-level labeling of images (e.g., road vs. car vs. sidewalk)Augmented Manifest (Image + Annotation)Pixel-level precision. Use when bounding boxes are not detailed enough

Natural Language Processing

AlgorithmUse CaseInput FormatKey Details
BlazingTextText classification (supervised) or word embeddings (unsupervised/Word2Vec)Augmented manifest (supervised), plain text (unsupervised)Unsupervised modes: CBOW, Skip-gram, Batch Skip-gram
Seq2SeqMachine translation, text summarization, speech-to-textRecordIO (protobuf)Encoder-decoder architecture: input sequence maps to output sequence
Object2VecCreate embeddings for pairs of objects — relationship modelingJSON LinesGeneralizes Word2Vec to arbitrary objects (sentences, customers, products)

Unsupervised Learning Algorithms

AlgorithmUse CaseInput FormatKey Details
K-MeansCluster similar data points — customer segmentationRecordIO, CSVUse the elbow method to find optimal k. Often paired with PCA
PCADimensionality reduction — reduce features while preserving varianceRecordIO, CSVData must be scaled first. PCA is unsupervised — it does NOT provide feature importance relative to a target
Random Cut Forest (RCF)Anomaly detection in datasetsRecordIO, CSVAlso available in Kinesis Data Analytics for real-time streaming anomaly detection
Neural Topic Model (NTM)Discover topics in text document collectionsRecordIO, CSV (bag of words)Neural-network-based alternative to LDA
LDADiscover topics in text documentsRecordIO, CSV (bag of words)For text topic modeling only — not for structured/tabular data
IP InsightsDetect anomalous IP address usage patternsCSV (user, IP pairs)Learns normal user-IP associations and flags unusual access

Algorithm Selection Quick Reference

Use this guide to quickly match your problem to the right algorithm:

Problem TypeFirst Choice
Tabular classification/regressionXGBoost
Time-series forecastingDeepAR
Text classificationBlazingText (supervised mode)
Word embeddingsBlazingText (unsupervised mode)
Image classificationImage Classification
Object detection in imagesObject Detection
Pixel-level image labelingSemantic Segmentation
Clustering / groupingK-Means
Dimensionality reductionPCA
Anomaly detectionRandom Cut Forest
Recommendations (sparse data)Factorization Machines
Topic modeling in textNTM or LDA
Translation / summarizationSeq2Seq

When to Use

Choose built-in algorithms when your problem fits one of the supported categories and you want optimized, distributed training without writing custom code. If you need a framework not covered by built-in algorithms (e.g., a custom PyTorch architecture), use SageMaker's BYOC (Bring Your Own Container) pattern with ECR.

Flashcards

1 / 10
Question

Which SageMaker algorithm is the default choice for tabular/structured data problems?

Click to reveal
Answer

XGBoost. It supports both classification and regression, handles CSV/LibSVM/Parquet, and offers scale_pos_weight for imbalanced classes.

Common Misconception

LDA (Latent Dirichlet Allocation) and NTM are designed exclusively for topic modeling on text documents. They do not work on structured/tabular data. If you need to find patterns in tabular data, use K-Means for clustering or PCA for dimensionality reduction.