Skip to main content

Algorithm Selection

Selecting the right algorithm is where theory meets practice. The best algorithm depends on your data type (tabular, text, images, time series), whether you have labels, and the specific problem you are solving. This guide covers the most important algorithms and when to reach for each one.

Quick Reference​

Supervised — Tabular Data​

AlgorithmBest ForStrengthsLimitations
XGBoostTabular classification and regression (#1 default for structured data)Handles missing values, built-in feature importance, regularizationNot for images, text sequences, or very high-dimensional sparse data
Random ForestClassification and regression with less tuningRobust, handles non-linear relationships, Gini importanceSlower than XGBoost, larger model size
Logistic Regression / Linear LearnerBinary/multi-class classification with linear decision boundaryFast, interpretable coefficientsCannot capture complex non-linear patterns
k-Nearest Neighbors (k-NN)Classification + "find similar items"Simple, no training phaseSlow at inference for large datasets, sensitive to dimensionality
Factorization MachinesRecommendation systems, click-through prediction, sparse dataHandles high-dimensional sparse data, captures feature interactionsLimited to pairwise feature interactions

Supervised — Time Series​

AlgorithmBest ForKey Feature
DeepARMultiple related time series, cold-startLearns patterns across related series, handles NaN, probabilistic forecasts
ARIMA / SARIMASingle time series, statistical approachGood for stationary data with clear trend/seasonality
CNN-QRForecasting with related time series + metadataSupports related data, holidays, promotions
Exponential Smoothing (ETS)Simple time series with trend/seasonalityCannot use related time series or metadata

Unsupervised​

AlgorithmBest ForKey Detail
K-MeansClustering — group similar data pointsElbow method for optimal k. Often paired with PCA
PCADimensionality reduction — compress features while preserving varianceMust scale data first. Unsupervised. Does NOT give feature importance for target
Random Cut Forest (RCF)Anomaly detection — find outliersHigher anomaly score = more anomalous
LDA / NTMTopic modeling — discover topics in TEXT documentsFor text only, not structured tabular data
t-SNEVisualization of high-dimensional data in 2D/3DFor visualization ONLY, not for feature reduction in production

Flashcards​

1 / 10
Question

What is the default go-to algorithm for structured/tabular data?

Click to reveal
Answer

XGBoost — it handles missing values natively, provides built-in feature importance, includes regularization, and works well out-of-the-box for both classification and regression on tabular data.

Key Insight

Time Series Algorithm Selection: Single series, simple = ARIMA. Multiple related series OR new products = DeepAR. Need related features + promotions = CNN-QR. "Predict demand for NEW product" = DeepAR (cold-start capability).