Skip to main content

Statistics & Probability

A solid grasp of statistics underpins effective machine learning. From choosing the right data transforms to diagnosing model problems, statistical concepts help you make informed decisions at every stage of the ML pipeline.

Distributions​

Understanding data distributions helps you select appropriate models and transformations.

DistributionShape / TypeWhen It AppliesExample
Normal (Gaussian)Continuous, symmetric, bell-shapedMost common assumption in statisticsHeights, test scores, measurement errors
PoissonDiscrete, count dataCount of events in a fixed intervalNumber of calls per hour, defects per batch
BinomialDiscrete, number of successesFixed number of independent trialsCoin flips, pass/fail in n tests
UniformAll values equally likelyNo prior knowledge about distributionRandom number generators, initial random weights
import numpy as np
import scipy.stats as stats

# Generate samples from common distributions
normal_data = np.random.normal(mean=0, scale=1, size=1000)
poisson_data = np.random.poisson(lam=5, size=1000)
binomial_data = np.random.binomial(n=10, p=0.5, size=1000)
uniform_data = np.random.uniform(low=0, high=1, size=1000)

Skewness and Transformations​

Distribution ShapeRelationshipTransform
Right-skewed (long tail to the right)Mode < Median < MeanLog transform to normalize
Left-skewed (long tail to the left)Mean < Median < ModeSquare or exponential transform may help
Symmetric (Normal)Mean ≈ Median ≈ ModeNo transform needed
Key Insight

When you see "mode < median < mean" or "long tail to the right," the data is right-skewed. Apply a log transform to make the distribution more normal, which improves the performance of linear models and models that assume normality.

import numpy as np

# Right-skewed data → log transform
original_data = [1, 2, 3, 5, 8, 15, 50, 200, 1000]
transformed = np.log1p(original_data) # log(1 + x) handles zeros

Correlation and Multicollinearity​

Correlation measures the linear relationship between two variables. When features are highly correlated, it creates multicollinearity problems for many models.

MethodWhat It Tells YouThreshold
Correlation Matrix / HeatmapPairwise correlation between features|r| > 0.8 = highly correlated
VIF (Variance Inflation Factor)How much a feature's variance is inflated by correlation with othersVIF > 5-10 = problematic
PCA eigenvaluesNear-zero eigenvalue = features are linearly dependentEigenvalue near 0 = collinear
import pandas as pd

# Detect multicollinearity
correlation_matrix = df.corr()

# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data[vif_data["VIF"] > 5]) # Flag problematic features

Outlier Detection​

MethodHow It WorksWhen to Use
IQR RuleOutlier if value < Q1 - 1.5IQR or > Q3 + 1.5IQRGeneral-purpose, robust
Z-ScoreOutlier if |z-score| > 3Data is approximately normal
Box PlotsVisual identification of outliersQuick exploratory analysis

What to do with outliers: Remove them, cap/winsorize them, or apply a robust transformation (log, square root).

Residual Plot Analysis​

Residual plots (predicted vs residual) help diagnose model problems after training.

Pattern in Residual PlotWhat It MeansAction
Random scatter around zeroGood model fitNo action needed
Curved pattern (U-shape)Non-linear relationship not capturedAdd polynomial features or use a non-linear model
Fan shape (wider spread on one side)Heteroscedasticity (non-constant variance)Log transform the target variable or use robust regression
Systematic over/underestimationModel has systematic biasAdd missing features or change model type
note

A good residual plot shows random scatter around zero with no visible patterns. Any systematic pattern indicates your model is missing something.

p-Values and Statistical Significance​

A p-value is the probability of observing results as extreme as the data, assuming the null hypothesis is true.

  • p < 0.05 → typically considered statistically significant
  • Used in feature selection (chi-squared tests, ANOVA)
  • Lower p-value = stronger evidence against the null hypothesis

Flashcards​

1 / 9
Question

How do you identify a right-skewed distribution from summary statistics?

Click to reveal
Answer

Mode < Median < Mean. The long tail extends to the right, pulling the mean higher than the median. Apply a log transform to normalize it.