Statistical Significance

KEY TAKEAWAYS
– Statistical significance is a determination about whether observed data are unlikely to have occurred by chance alone under a specified null hypothesis.
– It is typically decided by hypothesis testing and a p-value; a conventional cutoff is α = 0.05 (5%), but the cutoff should depend on context.
– Statistical significance does not equal practical, clinical, or economic significance — always check effect sizes and confidence intervals.
– Important complementary concepts: null and alternative hypotheses, p-values, Type I/II errors, statistical power, assumptions of the test, and corrections for multiple comparisons.
– Widely used in medicine (drug trials, vaccines) and finance (event studies, product announcements), but misuse and misinterpretation are common.

UNDERSTANDING STATISTICAL SIGNIFICANCE
– Definition: Statistical significance is the conclusion that the observed relationship or effect in a sample is unlikely to be due to random sampling variation alone, given a null hypothesis that typically states “no effect” or “no difference.”
– Null hypothesis (H0): the default claim (e.g., treatment has no effect; returns before and after an event are the same).
– Alternative hypothesis (H1): the research claim (e.g., treatment reduces disease; returns change after an event).
– Decision rule: after calculating a test statistic and p-value, compare the p-value to a pre-specified significance level (α). If p ≤ α, reject H0 (result called “statistically significant”); if p > α, fail to reject H0.

WHAT IS A P-VALUE?
– The p-value is the probability, computed under the assumption that H0 is true, of observing a test statistic as extreme as (or more extreme than) the one observed.
– Interpretation pitfalls: a p-value is not the probability that H0 is true; it does not measure effect size or importance.
– Common convention: α = 0.05. If p ≤ 0.05, results are often labeled “statistically significant.” Some fields use stricter thresholds (e.g., 0.01), others more lenient, depending on consequences of errors.

HOW STATISTICAL SIGNIFICANCE IS DETERMINED — STEP-BY-STEP
1. Define the question and hypotheses
• Formulate H0 and H1 clearly (one-sided vs. two-sided).
2. Choose significance level (α)
• Typical default: 0.05. Use smaller α for high-cost Type I errors.
3. Select appropriate statistical test
• Examples: t-test, chi-square, ANOVA, regression, nonparametric tests. Choice depends on variable types, distribution, sample design.
4. Check test assumptions
• Independence, normality (for some tests), equal variances, randomized sampling/assignment. If assumptions fail, use robust or nonparametric methods.
5. Compute test statistic and p-value
• Using analytic formulae or software (R, Python/scipy.stats, Stata, SPSS).
6. Make decision and report results
• Compare p-value to α; report p-value, test statistic, degrees of freedom (if relevant), sample size, effect size, and confidence interval.
7. Evaluate robustness and practical relevance
• Check effect size, confidence intervals, sensitivity to assumptions, and whether the observed effect is meaningful in context.
8. Consider multiple testing correction (if multiple hypotheses)
• Methods: Bonferroni, Holm, Benjamini–Hochberg (FDR).
9. Assess power and sample-size adequacy
• Low sample size can produce non-significant results even for meaningful effects (Type II error). Pre-study power calculations are recommended.
10. Pre-register study and avoid p-hacking
• Pre-specify analyses and report all comparisons to avoid selective reporting.

PRACTICAL EXAMPLE — FINANCE
– Scenario: Analyst Alex compares average daily returns before and after a company’s sudden failure to test whether some traders had advance knowledge.
– Result A: p = 0.28 (> 0.05) — difference is consistent with chance; do not reject H0.
– Result B: p = 0.0001 (< 0.05) — observed difference is very unlikely under H0; reject H0 and investigate further.
(Example adapted from Investopedia and event-study literature) [Investopedia; Hwang 2013; Rothenstein et al. 2011].

PRACTICAL EXAMPLE — MEDICAL TRIALS
– Pharmaceutical trial reports p = 0.04 for a new insulin’s efficacy. Because p α) does not prove H0; it indicates insufficient evidence to reject H0.
– P-hacking / data-dredging: selective analysis or multiple unreported tests inflate the chance of false positives.
– Overreliance on 0.05 cut-off: thresholds are conventions, not laws. Consider context, costs of errors, and prior evidence.
– Ignoring assumptions and model fit: violation of test assumptions can invalidate inference.
– Not reporting effect sizes or confidence intervals: p-values alone are incomplete.

BEST PRACTICES & PRACTICAL STEPS CHECKLIST (for analysts and researchers)
1. Pre-specify hypothesis, endpoint, and analysis plan where possible (pre-registration).
2. Choose α based on context (0.05 default; use smaller for high-cost false positives).
3. Select test and verify assumptions; if violated, use robust methods or nonparametric tests.
4. Compute p-value and test statistic using validated software.
5. Always report: p-value, effect size (e.g., mean difference, odds ratio), 95% confidence interval, sample sizes, and any data exclusions.
6. For multiple comparisons, correct for familywise error or control FDR.
7. Perform power/sample size calculations before study to ensure adequate ability to detect meaningful effects.
8. Perform sensitivity analyses and robustness checks (alternative models, outlier handling).
9. Interpret results in context: biological, economic, clinical relevance; prior evidence; plausibility.
10. Share data and code when possible for reproducibility.

ALTERNATIVES AND COMPLEMENTARY APPROACHES
– Confidence intervals: show range of plausible effect sizes and help judge practical significance.
– Effect-size metrics: Cohen’s d, risk ratios, differences in means — quantify magnitude.
– Bayesian methods: compute posterior probabilities and credible intervals; useful when prior information is available.
– Equivalence and non-inferiority testing: useful when proving “no important difference” is required.
– Prevalence of replication: emphasize replication studies to confirm findings.

REPORTING TEMPLATE (brief)
– Research question/hypothesis
– Design and sample (N, randomization)
– Test used and assumptions checked
– Significance level (α)
– Test statistic, degrees of freedom, and p-value
– Effect size and 95% CI
– Power (post-hoc or planned) and multiple-testing corrections
– Limitations, practical relevance, and robustness checks

HOW STATISTICAL SIGNIFICANCE IS USED IN PRACTICE
– Medicine: to decide whether a drug or vaccine produces a measurable effect beyond chance; regulators require statistically robust evidence combined with safety data [StatPearls; ADA Onset 7].
– Finance: event studies measure abnormal returns around announcements; statistical significance indicates whether price moves are likely tied to the event versus chance [Hwang 2013; Rothenstein et al. 2011].
– Business/marketing: A/B testing to decide which website variant or campaign performs better; practical significance and conversion lift matter as much as p-values.

THE BOTTOM LINE
Statistical significance is a formal way to judge whether observed effects are unlikely to be explained by chance under a prespecified null hypothesis. It is central to hypothesis testing but must be interpreted alongside effect sizes, confidence intervals, study design, assumptions, and real-world importance. Use careful planning (including power calculations and pre-registration), transparent reporting, corrections for multiple tests, and robustness checks to reduce false positives and improve the credibility of findings.

SOURCES & FURTHER READING
– Investopedia — “Statistically Significant” (Paige McLaughlin).
– Tenny S, Abdelgawad I. “Statistical Significance.” StatPearls Publishing, 2023.
– Hwang TJ. “Stock Market Returns and Clinical Trial Results of Investigational Compounds: An Event Study Analysis of Large Biopharmaceutical Companies.” PLoS One, 2013.
– Rothenstein J, et al. “Company Stock Prices Before and After Public Announcements Related to Oncology Drugs.” Journal of the National Cancer Institute, 2011.
– American Diabetes Association. “Efficacy and Safety of Fast-Acting Aspart Compared With Insulin Aspart… Onset 7 Trial.”

Editor’s note: The following topics are reserved for upcoming updates and will be expanded with detailed examples and datasets.