Hypothesis Testing - DominionFX

Hypothesis testing (also called significance testing) is a formal statistical method for deciding whether an observed pattern in sample data provides enough evidence to support a particular claim about a population. It translates an empirical question into two opposing statements (the null and the alternative hypotheses), computes how likely the observed data are under the null, and then decides whether that likelihood is small enough to reject the null hypothesis.

Key ideas (short)
– Null hypothesis (H0): a baseline claim—often “no effect,” “no difference,” or a specified value (e.g., μ = 0).
– Alternative hypothesis (H1 or Ha): what you want to show (e.g., μ ≠ 0, μ > 0, p α, fail to reject H0.
5. Report the result with effect sizes and confidence intervals so readers can judge practical as well as statistical significance.

Four-step process (practical)
1. State hypotheses
• Example formats: two-tailed (H0: p = 0.5 vs H1: p ≠ 0.5), right-tailed (H1: μ > μ0), left-tailed (H1: μ α: fail to reject H0 (do not claim H0 is true; consider power/sensitivity).
• State possible errors and limitations.

Concrete example — coin fairness (practice)
– Question: Is a penny fair? H0: p = 0.5 (probability of heads = 50%). H1: p ≠ 0.5. Two-tailed test.
– Sample: 100 flips.
• Case A: 40 heads. p̂ = 0.40. Under H0, SE = sqrt(0.5·0.5/100) = 0.05. z = (0.40–0.50)/0.05 = −2.00. Two-tailed p ≈ 2·Φ(−2) ≈ 0.0456. With α = 0.05, p < α, so reject H0 — evidence the coin is biased.
• Case B: 48 heads. p̂ = 0.48. z = (0.48–0.50)/0.05 = −0.40. Two-tailed p ≈ 0.69. Fail to reject H0 — result is plausibly due to sampling variation.
Note: “fail to reject” does not prove fairness; it means the data do not provide strong evidence of bias.

Types of tests and when to use them (practical)
– z-test for a mean or proportion (large samples or known σ).
– t-test for a population mean (small samples, unknown σ). Paired t-test for before-after designs.
– Chi-square test for association or goodness-of-fit (categorical data).
– Fisher’s exact test for small-sample contingency tables.
– ANOVA for comparing means across 3+ groups.
– Regression t-tests/Wald tests for coefficients in regression models.
Choose the test that matches the data type (continuous vs categorical), sample size, and assumptions.

Errors and statistical power
– Type I error (false positive): rejecting H0 when it is true. Controlled by α (e.g., 5%).
– Type II error (false negative): failing to reject H0 when H1 is true. Denote probability β.
– Power = 1 − β: probability of detecting a true effect of a specified size. Power depends on effect size, α, sample size, and variability. Plan sample size to achieve adequate power (commonly 80% or 90%) for the smallest effect that matters.

Assumptions and validity
– Valid inference requires that key assumptions hold: random sampling, independence of observations, appropriate distributional assumptions (or sufficiently large sample for asymptotic approximations), correct model specification.
– Violations (e.g., dependent observations, heavy-tailed distributions, heteroskedasticity) can lead to invalid p-values and wrong conclusions. Use robust methods or nonparametric tests when assumptions fail.

Multiple comparisons and corrections
– Testing many hypotheses increases false positive risk. Common corrections: Bonferroni, Holm, Benjamini–Hochberg (controls false discovery rate). Pre-register hypotheses when possible to avoid “data dredging” or p-hacking.

Benefits of hypothesis testing
– Provides a structured, reproducible procedure to evaluate claims using data.
– Quantifies strength of evidence (p-values, confidence intervals).
– Helps avoid conclusions driven by intuition or anecdote; supports data-driven decision-making.
– Widely applicable across science, finance, business, medicine, and policy.

Limitations and common pitfalls
– p-values are often misinterpreted: a p-value is not the probability that H0 is true.
– “Statistical significance” is not the same as practical significance. Report effect sizes and CIs.
– Results depend on sample quality; biased or non-random samples undermine conclusions.
– Multiple testing, selective reporting, and p-hacking inflate false-positive rates.
– Hypothesis tests do not establish causality unless the design justifies causal inference (randomization, controlled experiments).
– “Fail to reject” is weaker than demonstrating equivalence; an equivalence test or power analysis is needed to argue similarity.

When did hypothesis testing begin?
– Early roots: John Arbuthnot (1710) is often credited with an early significance-style argument using birth counts to argue that more males are born than females, reasoning the outcome was unlikely by chance [Elder Research].
– Modern development: formal frameworks were developed in the early–mid 20th century (Ronald Fisher’s significance testing; Neyman and Pearson’s hypothesis testing framework of Type I/II errors, power, and decision rules).

Practical checklist for conducting and reporting a test
1. Define the research question clearly.
2. State H0 and H1 explicitly, and choose one- or two-tailed.
3. Choose α and justify it (context matters).
4. Select the appropriate statistical test and check assumptions.
5. Pre-calculate required sample size/power for the minimum effect of interest (if possible).
6. Compute test statistic, p-value, and confidence interval.
7. Report effect size with uncertainty, not just p-values.
8. Mention potential biases, assumption violations, and multiple-testing corrections if applicable.
9. Interpret in context: practical importance, not just statistical significance.
10. If possible, release data and code to improve reproducibility.

Explain like I’m five (ELI5)
Imagine you want to know if a new cookie recipe makes bigger cookies than the old one. You bake a few with each recipe and measure them. Hypothesis testing is a way to check if the difference you see could just be luck from sampling a few cookies, or whether it’s likely the new recipe really makes bigger cookies. You decide in advance how much chance of being tricked by luck you’ll accept (that’s α). Then you use math to see if the measured difference is big enough to be convincing.

The bottom line
Hypothesis testing is a foundational statistical tool for deciding whether observed data are compatible with a prespecified null hypothesis. When used properly—choosing appropriate tests, checking assumptions, planning for power, correcting for multiple comparisons, and reporting effect sizes and uncertainty—it supports rigorous, reproducible conclusions. Misuse, misinterpretation, or reliance solely on p-values can produce misleading results; combine hypothesis tests with good study design and transparent reporting.

Sources and further reading
– Investopedia, “Hypothesis Testing,” Jessica Olah.
– Sage, “Introduction to Hypothesis Testing.”
– Elder Research, “Who Invented the Null Hypothesis?”
– Formplus, “Hypothesis Testing: Definition, Uses, Limitations and Examples.”

– Walk through a specific dataset step-by-step (choose the test), or
– Show calculations for a given example (e.g., t-test with sample numbers), or
– Provide R/Python code templates to run standard hypothesis tests and report results.