Sampling Distribution

A sampling distribution is the probability distribution of a statistic (for example, a sample mean, sample proportion, or sample variance) computed from many repeated random samples of the same size drawn from the same population. It shows the range of values that statistic can take and how often each value occurs. Sampling distributions are the foundation for estimation (confidence intervals) and hypothesis testing because they tell us how much sample statistics typically vary from sample to sample.

Key takeaways
– A sampling distribution describes how a statistic (mean, proportion, variance, etc.) varies across repeated samples from the same population. (Investopedia)
– The mean of the sampling distribution of the sample mean equals the population mean; the spread is measured by the standard error. (Investopedia)
– The Central Limit Theorem (CLT) implies sampling distributions of the mean tend toward normality as sample size increases, regardless of the population’s shape (with some caveats). (Penn State)
– When population standard deviation is unknown, use the sample standard deviation and the t-distribution for inference. (NJIT/Penn State)

Why sampling is used (short)
Researchers sample because measuring every member of a population is often impractical, expensive, or impossible. A well-designed sample provides information about the population with manageable cost and effort; sampling distributions let researchers quantify the uncertainty in their estimates. (OECD; Investopedia)

How sampling distributions work
– Draw many independent random samples of the same size n from a population.
– Compute the statistic of interest (e.g., sample mean x̄) for each sample.
– The distribution of those statistics — their frequencies across samples — is the sampling distribution.
– Important properties for the sample mean x̄:
• Expected value: E[x̄] = μ (population mean) — the sample mean is unbiased.
• Standard error (SE): SE(x̄) = σ / sqrt(n) when population standard deviation σ is known. If σ is unknown, estimate SE with s / sqrt(n) and use the t-distribution for small n.
• Shape: by the CLT, x̄ is approximately normally distributed for sufficiently large n, even if the population is not normal.

Types of sampling distributions (common examples)
– Sampling distribution of the mean (x̄)
– Sampling distribution of a proportion (p̂)
– Sampling distribution of the variance (s^2) — related to the chi-square distribution
– Sampling distribution of the difference between two means or two proportions
– When σ is unknown and sample sizes are small, the sample-mean sampling distribution follows a t-distribution rather than a normal

Important considerations and caveats
– Sample size matters: larger n reduces the standard error and makes the sampling distribution more concentrated about the population value.
– Independence: samples must be independent; clustering or dependence invalidates standard formulas.
– Representativeness: sampling method must avoid bias (use proper randomization, stratification, or clustering as appropriate).
– Finite population correction: when sampling without replacement from a finite population and n is a non-negligible fraction of the population N, multiply SE by sqrt((N − n) / (N − 1)).
– Small samples from highly skewed populations: CLT may not hold; consider transformations, nonparametric methods, or resampling (bootstrap).
– Measurement error, nonresponse, and selection bias change the sampling distribution in ways that cannot be corrected by larger n alone.

Determining a sampling distribution — practical steps
Below are two parallel approaches often used in practice: analytic (formula-based) and simulation/resampling.

A. Analytic (using formulas and CLT)
1. Define the parameter/statistic you need (mean, proportion, variance).
2. Choose the sampling method (simple random, stratified, cluster, systematic).
3. Determine or estimate population parameters:
• For mean: estimate or obtain population σ; if unknown use pilot data or sample s.
• For proportion: use p (or p̂ if unknown) in formulas.
4. Compute the standard error:
• Mean: SE = σ / sqrt(n) (or s / sqrt(n) if σ unknown).
• Proportion: SE = sqrt[p(1 − p) / n].
5. Use the CLT: approximate sampling distribution by a normal distribution centered at the population value with variance SE^2 when n is large enough (rule of thumb: n ≥ 30 for many situations, but depends on population shape; for proportions, require np ≥ 10 and n(1 − p) ≥ 10).
6. Construct confidence intervals or perform hypothesis tests using normal or t critical values as appropriate.

B. Simulation / Resampling (empirical)
1. If analytic assumptions are doubtful or population distribution unknown, use simulation:
• If you can simulate the population: repeatedly draw many samples of size n, compute the statistic for each, and plot the distribution of the statistic.
• If you only have one observed sample: use bootstrap resampling (sample with replacement from the observed sample many times) to approximate the sampling distribution of the statistic.
2. From the empirical sampling distribution, compute SE (standard deviation of the bootstrap/statistic values), bias, confidence intervals (percentile or bias-corrected), and p-values.
3. Plot the empirical sampling distribution with histogram or density curve and compare with theoretical shapes (normal, t).

Practical example — sample mean (baby weights)
Scenario: want to estimate mean birth weight (μ) using samples of n = 100 babies.
– Suppose population σ = 1.2 lb (or estimate s ≈ 1.2 from pilot data).
– SE = σ / sqrt(n) = 1.2 / 10 = 0.12 lb: typical sample means will be about ±0.12 lb around μ.
– If you draw many samples of 100 babies and compute x̄ for each, those x̄ values will form a sampling distribution approximately normal with mean μ and SD = 0.12.
– A 95% confidence interval for μ based on one sample mean x̄ is x̄ ± 1.96 × 0.12 = x̄ ± 0.235 lb (assuming σ known). If σ unknown and sample size is moderate, use t-critical value.

Calculating sample size (practical formulas)
– For estimating a mean with margin of error M and confidence level (z*):
n = (z* × σ / M)^2
Use pilot σ or conservative estimate.
– For estimating a proportion with margin of error M:
n = (z*^2 × p(1 − p)) / M^2
If p unknown, use p = 0.5 for maximum variance (conservative).

Plotting sampling distributions — practical tips
– For analytic approach: plot a normal density curve with mean = population parameter and SD = SE; optionally overlay histogram of simulated sample statistics.
– For empirical approach: plot histogram or kernel density of simulated or bootstrap statistics; overlay normal curve to assess normality; use Q–Q plots to check deviations from normal.
– Annotate with mean, SE, and confidence interval boundaries to aid interpretation.

Special techniques and extensions
– t-distribution: when σ is unknown and sample size is small, the sampling distribution of (x̄ − μ)/(s / sqrt(n)) follows a t-distribution with n − 1 degrees of freedom.
– Bootstrap: powerful for complex statistics or when theoretical sampling distributions are difficult to derive. Resample the observed sample many times (typically ≥1,000) and compute the statistic each time.
– Stratified sampling: reduces variance by sampling within homogeneous strata; sampling distribution of combined estimator accounts for stratum weights.
– Cluster sampling: useful for cost-effective data collection, but increases variance relative to simple random sampling; account for design effect when computing SE.
– Finite population correction (FPC): if sampling fraction n/N > ~5%, adjust SE by sqrt((N − n)/(N − 1)).

Why sampling distributions matter — practical uses
– Quantify uncertainty: SE and sampling distributions show how much sample results can differ from true population values.
– Construct confidence intervals: sampling distribution defines how wide CIs must be to achieve desired coverage.
– Hypothesis testing: p-values come from comparing observed statistics to their sampling distributions under the null hypothesis.
– Decision-making: businesses, governments, and researchers rely on sampling distributions to decide whether observed effects are likely real or due to sampling variability. (Investopedia)

Common pitfalls and how to avoid them
– Ignoring bias: a large sample from a biased procedure does not fix bias. Ensure randomization or proper weighting.
– Assuming normality too soon: for small n or highly skewed populations, normal approximations may be poor; use bootstrap or transform the data.
– Underestimating SE in complex designs: account for clustering, stratification, and weighting in variance estimation.
– Confusing sampling variability with measurement error or other sources of uncertainty: address each source explicitly.

Fast facts (summary)
– Mean of sampling distribution of x̄ = population mean μ.
– SE(x̄) = σ / sqrt(n) (or s / sqrt(n) if σ unknown); decreases as n increases.
– By CLT, sampling distribution of x̄ is approximately normal for large n, regardless of population shape (with exceptions).
– Use bootstrap when theoretical sampling distributions are unavailable or assumptions fail.

References and further reading
– Investopedia, “Sampling Distribution” (Ryan Oakley) — overview and examples.
– Penn State University, Eberly College of Science, “4.1 — Sampling Distributions” — Central Limit Theorem and properties of sampling distributions.
– New Jersey Institute of Technology, “Sampling Distributions” — distributions for sample mean, proportion, variance.
– Organisation for Economic Co-operation and Development (OECD), “Population” — definitions related to populations and samples.

Practical checklist for conducting a sampling-distribution–based analysis
1. Define target population and parameter of interest.
2. Choose sampling design that yields representative data (SRS, stratified, cluster, etc.).
3. Determine sample size using desired margin of error and confidence level; account for design effect.
4. Collect data ensuring independence and minimizing measurement error/nonresponse.
5. Compute sample statistic(s) and estimate SE (analytic formula or bootstrap).
6. Check assumptions: normality (CLT), independence, homoscedasticity as relevant.
7. Use the sampling distribution to form confidence intervals, run hypothesis tests, and make decisions.
8. Report results with uncertainty measures and document sampling method and limitations.

Editor’s note: The following topics are reserved for upcoming updates and will be expanded with detailed examples and datasets.