Central Limit Theorem

What is the Central Limit Theorem (CLT)?
– Definition: The Central Limit Theorem states that when you take many independent random samples of size n from the same population, the distribution of the sample means tends to become approximately normal (a bell curve) as n grows, regardless of the shape of the original population distribution—provided the population has a finite variance.

Why the CLT matters (brief)
– It underpins many common inferential tools: confidence intervals, hypothesis tests, control charts, and many machine-learning estimators assume approximate normality of averages or sums.
– It allows practitioners to use Normal-distribution procedures even when the original data are non‑Normal, provided sample sizes are adequate and assumptions hold.

Mathematical statement (simple form)
– If X1, X2, …, Xn are independent, identically distributed (i.i.d.) random variables with mean μ and finite variance σ^2, then the sampling distribution of the sample mean X̄n = (1/n) Σ Xi satisfies
X̄n ≈ Normal(mean = μ, variance = σ^2/n)
as n → ∞.
– Equivalently, the standardized sum (Σ Xi − nμ) / (σ√n) converges in distribution to the standard normal N(0,1).

Standard error and practical formulae
– Standard error (SE) of the sample mean when population σ is known:
SE = σ / √n
– If σ is unknown (common in practice), substitute the sample standard deviation s:
estimated SE = s / √n
For small n, use Student’s t distribution with df = n − 1 for inference rather than the Normal.
– When sampling without replacement from a finite population of size N, apply the finite population correction (FPC):
SE_corrected = (σ/√n) × √[(N − n)/(N − 1)]

Rule-of-thumb for sample size
– n ≥ 30 is a common heuristic for approximate normality of X̄ when the underlying distribution is not heavily skewed or heavy‑tailed. This is only a rule of thumb; required n grows with skewness and kurtosis of the original distribution.
– For roughly symmetric distributions, smaller n may suffice; for strongly skewed or heavy-tailed distributions, much larger n is needed.

Worked numeric example
– Suppose individual returns have mean μ = 50 and σ = 10. You draw n = 36 independent observations and compute the sample mean X̄.
SE = σ/√n = 10/6 = 1.6667.
What is P(48 ≤ X̄ ≤ 52)?
Compute z-scores: z1 = (48 − 50)/1.6667 = −1.2; z2 = (52 − 50)/1.6667 = +1.2.
Using the standard normal CDF Φ, P = Φ(1.2) − Φ(−1.2) = 2Φ(1.2) − 1 ≈ 2(0.8849) − 1 = 0.7698.
So ≈77.0% probability the sample mean falls between 48 and 52.

Applications (practical)
– Confidence intervals for the mean: X̄ ± z* × SE (z* from Normal or t* from Student’s t if σ unknown and n small).
– Hypothesis tests about means: use standardized test statistics based on X̄ and SE.
– Aggregation in risk models and portfolio theory: sums of many independent shocks often behave approximately normal.
– Simulation and bootstrapping: CLT guides sample size choices and interpretation of simulated means.

Limitations and caveats
– Independence: CLT requires independent (or weakly dependent under extensions) observations. Correlated data break the simple form and change the effective variance.
– Identical distribution: Variations of the theorem relax this, but mixing different distributions requires checking conditions (Lindeberg or Lyapunov conditions).
– Heavy tails / infinite variance: If the population variance is infinite (e.g., some Pareto distributions), the classical CLT does not apply; sums may converge to stable (non‑Normal) laws.
– Convergence speed: The CLT guarantees convergence in distribution as n→∞ but does not tell how large n must be for a good approximation. For skewed or heavy‑tailed data, n may need to be large.
– Small-sample inference: For small n, prefer exact or distribution-free methods, or use bootstrap methods that empirically approximate the sampling distribution.

Variants and formal conditions (brief)
– Lindeberg–Feller CLT: allows non‑identical independent variables under a Lindeberg condition.
– Lyapunov CLT: provides a sufficient condition based on moments.
– Multivariate CLT: extends to vectors—sample mean vector tends to multivariate normal.
– Functional CLT (Donsker’s theorem): concerns convergence of stochastic processes to Brownian motion.

Practical checklist before applying Normal-based inference
1. Are observations independent (or is dependence modeled)? If no, adjust for autocorrelation or clustering.
2. Is there a reasonable finite variance? Watch for heavy tails.
3. Is sample size large enough for the underlying skewness/kurtosis? Consider n ≥ 30 as a starting point; increase if data are skewed.
4. Is σ unknown? Use s and t-distribution for small n.
5. Are you sampling without replacement from a small finite population? Apply FPC.
6. If in doubt, run a bootstrap to estimate the sampling distribution empirically.

When to use the bootstrap instead
– Use the bootstrap when the sample size is small, the underlying distribution is unknown and possibly skewed, or analytic variance formulas are unreliable. The bootstrap resamples the observed data to approximate the sampling distribution of X̄ (or other estimators) without relying directly on CLT asymptotics.

Quick

Quick how-to: a basic bootstrap for the sample mean
1) Setup. You have one observed sample of size n and want an empirical sampling distribution for the sample mean X̄ (or any estimator θ̂). The bootstrap treats the observed sample as an approximation to the population and resamples from it with replacement.

2) Steps (practical checklist).
– Choose number of resamples B (common choices: 1,000–10,000; use more for tighter percentile estimates).
– For b = 1 to B:
a) Draw a bootstrap sample of size n from the observed sample with replacement.
b) Compute the statistic of interest θ̂*b (here: the mean of that bootstrap sample).
– Collect {θ̂*1, …, θ̂*B}. This is the bootstrap sampling distribution.
– Estimate standard error: se_boot = sd({θ̂*b}).
– Construct a 95% bootstrap percentile interval by taking the 2.5th and 97.5th percentiles of

the bootstrap distribution (θ̂* values). That gives a nonparametric 95% interval.

– Bias estimate and correction.
– Estimate bias: bias_boot = mean({θ̂*b}) − θ̂ (original sample statistic).
– Bias-corrected estimate: θ̂_corrected = θ̂ − bias_boot.
– Many applications ignore bias if it is small relative to se_boot; for nonnegligible bias consider a bias-corrected interval (see BCa below).

– Better interval methods (when percentile is inadequate).
– Basic

– Basic (the “reverse percentile” or basic bootstrap interval). This corrects the percentile interval for bias by reflecting around the original estimate θ̂.
– Steps:
1. Obtain θ̂ from the original sample.
2. Generate B bootstrap replicates {θ̂*1,…,θ̂*B}.
3. Find the α/2 and 1−α/2 quantiles of the bootstrap distribution, call them qL and qU.
4. The basic 100(1−α)% interval is [2θ̂ − qU, 2θ̂ − qL].
– Intuition: if bootstrap replicates are centered above θ̂, the basic interval shifts the percentile interval accordingly.
– Pros/cons: simple and often better than the raw percentile when the bootstrap distribution is shifted; still can be inaccurate for skewed or heteroskedastic problems.

– Studentized (percentile-t) bootstrap. This method uses a standardized (studentized) statistic to account for varying standard error across samples.
– Define the studentized statistic for a sample (or for each bootstrap sample) t̂ = (θ̂ − θ)/ŝ, where ŝ is an estimate of the standard error of θ̂.
– Steps:
1. For each bootstrap draw b, compute θ̂*b and its estimated standard error ŝ*b (this may itself require an inner bootstrap or analytic formula).
2. Compute t*b = (θ̂*b − θ̂)/ŝ*b for b = 1…B.
3. Take the α/2 and 1−α/2 quantiles of {t*b}, call them tL and tU.
4. The 100(1−α)% CI for θ is [θ̂ − tU·ŝ, θ̂ − tL·ŝ].
– Pros/cons: often more accurate than percentile/basic for many problems because it accounts for changing variability; more computationally intensive because ŝ*b must be estimated for each bootstrap sample.

– BCa (bias‑corrected and accelerated) bootstrap. This is widely recommended when bias and skewness matter. It adjusts percentiles using two parameters: z0 (bias-correction) and a (acceleration).
– Key quantities:
– z0 = Φ^{-1}(proportion of θ̂*b < θ̂). (Φ−1 is the standard normal inverse CDF.)
– a (acceleration) is estimated from the jackknife. Let θ̂(−i) be the estimate leaving out observation i, and θ̄(.) = mean of the θ̂(−i). Then
a = [∑(θ̄(.) − θ̂(−i))^3] / [6 · (∑(θ̄(.) − θ̂(−i))^2)^(3/2)].
– Adjusted percentile levels:
– For a desired lower tail α1 = Φ(z0 + (z0 + zα)/(1 − a(z0 + zα))) where zα = Φ−

α = Φ−1(α). Then the BCa adjusted lower and upper cumulative probabilities (α1, α2) are

– α1 = Φ( z0 + (z0 + zα) / (1 − a (z0 + zα)) )
– α2 = Φ( z0 + (z0 + z1−α) / (1 − a (z0 + z1−α)) )

where zα = Φ−1(α) and z1−α = Φ−1(1 − α). The BCa 100(1 − 2α)% confidence interval for θ is the [α1, α2] percentile interval of the bootstrap distribution {θ̂*b}.

Step-by-step BCa algorithm (nonparametric bootstrap)
1. Compute the original estimate θ̂ from the observed sample.
2. Generate B bootstrap samples (resampling with replacement) and compute θ̂*b for each sample; form the empirical bootstrap distribution.
3. Compute z0:
– z0 = Φ−1( proportion of θ̂*b < θ̂ ).
– If many bootstrap estimates equal θ̂, use the conservative count convention (strictly less than).
4. Compute the acceleration a via the jackknife:
– For each i = 1..n compute θ̂(−i) (estimate leaving out observation i).
– Let θ̄(.) = (1/n) Σ θ̂(−i).
– a = [ Σ (θ̄(.) − θ̂(−i))^3 ] / [ 6 · ( Σ (θ̄(.) − θ̂(−i))^2 )^(3/2) ].
5. For the desired two-sided level (1 − 2α), compute zα = Φ−1(α) and z1−α = −zα.
6. Compute α1, α2 using the