Statistics & Probability

A solid grasp of statistics underpins effective machine learning. From choosing the right data transforms to diagnosing model problems, statistical concepts help you make informed decisions at every stage of the ML pipeline.

Distributions

Understanding data distributions helps you select appropriate models and transformations.

Distribution	Shape / Type	When It Applies	Example
Normal (Gaussian)	Continuous, symmetric, bell-shaped	Most common assumption in statistics	Heights, test scores, measurement errors
Poisson	Discrete, count data	Count of events in a fixed interval	Number of calls per hour, defects per batch
Binomial	Discrete, number of successes	Fixed number of independent trials	Coin flips, pass/fail in n tests
Uniform	All values equally likely	No prior knowledge about distribution	Random number generators, initial random weights

import numpy as np
import scipy.stats as stats

# Generate samples from common distributions
normal_data = np.random.normal(mean=0, scale=1, size=1000)
poisson_data = np.random.poisson(lam=5, size=1000)
binomial_data = np.random.binomial(n=10, p=0.5, size=1000)
uniform_data = np.random.uniform(low=0, high=1, size=1000)

Skewness and Transformations

Distribution Shape	Relationship	Transform
Right-skewed (long tail to the right)	Mode < Median < Mean	Log transform to normalize
Left-skewed (long tail to the left)	Mean < Median < Mode	Square or exponential transform may help
Symmetric (Normal)	Mean ≈ Median ≈ Mode	No transform needed

Key Insight

When you see "mode < median < mean" or "long tail to the right," the data is right-skewed. Apply a log transform to make the distribution more normal, which improves the performance of linear models and models that assume normality.

import numpy as np

# Right-skewed data → log transform
original_data = [1, 2, 3, 5, 8, 15, 50, 200, 1000]
transformed = np.log1p(original_data)  # log(1 + x) handles zeros

Correlation and Multicollinearity

Correlation measures the linear relationship between two variables. When features are highly correlated, it creates multicollinearity problems for many models.

Method	What It Tells You	Threshold
Correlation Matrix / Heatmap	Pairwise correlation between features	\|r\| > 0.8 = highly correlated
VIF (Variance Inflation Factor)	How much a feature's variance is inflated by correlation with others	VIF > 5-10 = problematic
PCA eigenvalues	Near-zero eigenvalue = features are linearly dependent	Eigenvalue near 0 = collinear

import pandas as pd

# Detect multicollinearity
correlation_matrix = df.corr()

# Check VIF
from statsmodels.stats.outliers_influence import variance_inflation_factor

vif_data = pd.DataFrame()
vif_data["feature"] = X.columns
vif_data["VIF"] = [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
print(vif_data[vif_data["VIF"] > 5])  # Flag problematic features

Outlier Detection

Method	How It Works	When to Use
IQR Rule	Outlier if value < Q1 - 1.5IQR or > Q3 + 1.5IQR	General-purpose, robust
Z-Score	Outlier if \|z-score\| > 3	Data is approximately normal
Box Plots	Visual identification of outliers	Quick exploratory analysis

What to do with outliers: Remove them, cap/winsorize them, or apply a robust transformation (log, square root).

Residual Plot Analysis

Residual plots (predicted vs residual) help diagnose model problems after training.

Pattern in Residual Plot	What It Means	Action
Random scatter around zero	Good model fit	No action needed
Curved pattern (U-shape)	Non-linear relationship not captured	Add polynomial features or use a non-linear model
Fan shape (wider spread on one side)	Heteroscedasticity (non-constant variance)	Log transform the target variable or use robust regression
Systematic over/underestimation	Model has systematic bias	Add missing features or change model type

note

A good residual plot shows random scatter around zero with no visible patterns. Any systematic pattern indicates your model is missing something.

p-Values and Statistical Significance

A p-value is the probability of observing results as extreme as the data, assuming the null hypothesis is true.

p < 0.05 → typically considered statistically significant
Used in feature selection (chi-squared tests, ANOVA)
Lower p-value = stronger evidence against the null hypothesis

Flashcards

1 / 9

Question

How do you identify a right-skewed distribution from summary statistics?

Click to reveal

Answer

Mode < Median < Mean. The long tail extends to the right, pulling the mean higher than the median. Apply a log transform to normalize it.

Distributions​

Skewness and Transformations​

Correlation and Multicollinearity​

Outlier Detection​

Residual Plot Analysis​

p-Values and Statistical Significance​

Flashcards​