Algorithm Selection

Selecting the right algorithm is where theory meets practice. The best algorithm depends on your data type (tabular, text, images, time series), whether you have labels, and the specific problem you are solving. This guide covers the most important algorithms and when to reach for each one.

Quick Reference

Supervised — Tabular Data

Algorithm	Best For	Strengths	Limitations
XGBoost	Tabular classification and regression (#1 default for structured data)	Handles missing values, built-in feature importance, regularization	Not for images, text sequences, or very high-dimensional sparse data
Random Forest	Classification and regression with less tuning	Robust, handles non-linear relationships, Gini importance	Slower than XGBoost, larger model size
Logistic Regression / Linear Learner	Binary/multi-class classification with linear decision boundary	Fast, interpretable coefficients	Cannot capture complex non-linear patterns
k-Nearest Neighbors (k-NN)	Classification + "find similar items"	Simple, no training phase	Slow at inference for large datasets, sensitive to dimensionality
Factorization Machines	Recommendation systems, click-through prediction, sparse data	Handles high-dimensional sparse data, captures feature interactions	Limited to pairwise feature interactions

Supervised — Time Series

Algorithm	Best For	Key Feature
DeepAR	Multiple related time series, cold-start	Learns patterns across related series, handles NaN, probabilistic forecasts
ARIMA / SARIMA	Single time series, statistical approach	Good for stationary data with clear trend/seasonality
CNN-QR	Forecasting with related time series + metadata	Supports related data, holidays, promotions
Exponential Smoothing (ETS)	Simple time series with trend/seasonality	Cannot use related time series or metadata

Unsupervised

Algorithm	Best For	Key Detail
K-Means	Clustering — group similar data points	Elbow method for optimal k. Often paired with PCA
PCA	Dimensionality reduction — compress features while preserving variance	Must scale data first. Unsupervised. Does NOT give feature importance for target
Random Cut Forest (RCF)	Anomaly detection — find outliers	Higher anomaly score = more anomalous
LDA / NTM	Topic modeling — discover topics in TEXT documents	For text only, not structured tabular data
t-SNE	Visualization of high-dimensional data in 2D/3D	For visualization ONLY, not for feature reduction in production

Flashcards

1 / 10

Question

What is the default go-to algorithm for structured/tabular data?

Click to reveal

Answer

XGBoost — it handles missing values natively, provides built-in feature importance, includes regularization, and works well out-of-the-box for both classification and regression on tabular data.

Key Insight

Time Series Algorithm Selection: Single series, simple = ARIMA. Multiple related series OR new products = DeepAR. Need related features + promotions = CNN-QR. "Predict demand for NEW product" = DeepAR (cold-start capability).

Quick Reference​

Supervised — Tabular Data​

Supervised — Time Series​

Unsupervised​

Flashcards​

Quick Reference

Supervised — Tabular Data

Supervised — Time Series

Unsupervised

Flashcards