Confidenceinterval - DominionFX

What is a confidence interval (CI)?
– A confidence interval is a range of values, built from a sample, that is likely to contain an unknown population parameter (for example, a population mean). The confidence level (commonly 95% or 99%) is the long-run proportion of such intervals that would contain the true parameter if you repeated the same sampling procedure many times.

Key definitions (first use)
– Population parameter: the true number you want to estimate (e.g., population mean µ).
– Sample statistic: the value computed from your sample (e.g., sample mean x̄).
– Confidence level: the chosen probability (e.g., 95%) that the interval procedure captures the true parameter in repeated sampling.
– Significance level (α): 1 − confidence level (e.g., α = 0.05 for a 95% CI).
– Margin of error (ME): the half-width of the interval; how far above and below the sample statistic you extend.

How to read a CI in plain terms
– Saying “a 95% confidence interval for the mean is 88 to 92” means: using this method, 95% of similarly constructed intervals from repeated random samples would include the true population mean. It does not mean there is a 95% probability the particular interval contains the mean in a Bayesian sense without additional assumptions.

When CIs are used
– To quantify uncertainty around sample estimates.
– To judge whether results exclude a null value (for example, zero effect). If a CI for a mean difference includes zero, the data do not provide strong evidence of a nonzero difference.
– To report precision in forecasting, polling, experiments, and many applied analyses.

Basic formulas (for a population mean)
– If the population standard deviation σ is known (rare in practice), use:
CI = x̄ ± z* × (σ / √n)
where z* is the z-score for the chosen confidence level (e.g., 1.96 for 95%).
– If σ is unknown (typical), replace z* with t* from the Student’s t-distribution with df = n − 1:
CI = x̄ ± t*

× (s / √n).

t* is the critical value from Student’s t-distribution with df = n − 1 and depends on the chosen confidence level (for example, t* ≈ 2.064 for 95% confidence with df = 24). Use the t-based formula when σ (population standard deviation) is unknown and sample size is not large enough to safely rely on the normal approximation.

Confidence interval for a proportion
– For a sample proportion p̂ (p-hat), the commonly used approximate CI (Wald interval) is:
CI = p̂ ± z* × √[p̂(1 − p̂) / n]
where z* is the z-score for the chosen confidence level (e.g., 1.96 for 95%).
– Assumptions: observations independent, and np̂ and n(1 − p̂) are both reasonably large (rule of thumb ≥ 5 or ≥ 10). For small samples or extreme p̂, use an exact (Clopper–Pearson) or Wilson interval instead.

Margin of error (MoE)
– The margin of error is the half-width of the interval:
MoE = critical value × standard error
– For a mean (σ unknown): MoE = t* × (s / √n)
– For a proportion: MoE = z* × √[p̂(1 − p̂) / n]
– MoE lets you plan precision and communicate uncertainty succinctly.

Sample-size calculations (planning for a desired MoE)
– For estimating a mean (approximate, using σ estimate):
n ≈ (z* × σ / E)^2
where E is the desired margin of error and σ is an estimate (pilot data or domain knowledge). If σ unknown, use pilot study or conservative guess.
– For estimating a proportion:
n ≈ (z*^2 × p*(1 − p*)) / E^2
where p* is a prior estimate. If no prior, use p* = 0.5 (maximizes variance; conservative).
– Round n up to the next whole number. If sampling without replacement from a finite population N and n is a non-negligible fraction of N (often >5%), apply the finite population correction factor:
effective SE = √[(p(1−p)/n) × (N − n)/(N − 1)] (similarly for means).

Worked numeric examples

1) Mean with unknown σ (t-based)
– Data: sample mean x̄ = 100, sample standard deviation s = 15, n = 25. Desired: 95% CI.
– df = 24 → t* ≈ 2.064.
– Standard error SE = s / √n = 15 / 5 = 3.
– Margin of error MoE = t* × SE ≈ 2.064 × 3 = 6.192.
– 95% CI = 100 ± 6.192 = [93.808, 106.192].
– Interpretation: With repeated samples of this size, about 95% of constructed intervals would contain the true population mean (frequentist interpretation; see caveats below).

2) Proportion (z-based)
– Data: n = 400, p̂ = 0.55. Desired: 95% CI.
– z* = 1.96. SE = √[0.55×0.45 / 400] ≈ √[0.00061875] ≈ 0.024

– Continue proportion example (z-based)
– SE (more precise) = √[0.55×0.45 / 400] = √[0.00061875] ≈ 0.02489.
– Margin of error MoE = z* × SE = 1.96 × 0.02489 ≈ 0.0488.
– 95% CI = 0.55 ± 0.0488 = [0.5012, 0.5988] (

) Interpretation (continued)
– So the 95% CI for the sample proportion is 0.5012 to 0.5988. Interpreted in frequentist terms: if you repeated the same sampling process many times and constructed a 95% z-based CI from each sample, about 95% of those intervals would contain the true population proportion. See caveats below about the z-approximation and small samples.

Conditions, limitations and alternatives
– When the z-based proportion CI is OK: n large enough that np̂ ≥ 10 and n(1−p̂) ≥ 10 (rule-of-thumb). This ensures the sampling distribution of p̂ is approximately normal.
– When not OK (small n or p̂ near 0 or 1): the normal approximation can be poor. Alternatives include:
– Wilson (score) interval — better coverage for moderate samples.
– Agresti–Coull (adjusted) interval — simple adjustment that improves accuracy.
– Exact (Clopper–Pearson) interval — conservative but valid for any n.
– Bootstrap percentile or bias‑corrected intervals — resampling-based, useful when analytic formulas fail.
– For means: use t-based CI (t-distribution) when the population standard deviation σ is unknown and sample size is modest; check for approximate normality of the underlying distribution or rely on the central limit theorem (CLT) for sufficiently large n.

Quick reference formulas
– Mean (σ known or for conceptual use):
– SE = σ/√n
– CI = x̄ ± z* × SE
– Mean (σ unknown, use sample s and t* with df = n−1):
– SE = s/√n
– CI = x̄ ± t* × SE
– Proportion:
– SE = √[p̂(1−p̂)/n]
– z-based CI = p̂ ± z* × SE
– Margin-of-error for desired precision E:
– For mean (σ known): n = (z* σ / E)²
– For proportion: n = z*² p(1−p) / E² (use p=0.5 if unknown to be conservative)

Worked numeric sample-size examples
1) Mean: want E = 2 units, estimated σ = 15, 95% confidence (z* = 1.96)
– n = (1.96 × 15 / 2)² = (14.7)² ≈ 216.1 → round up to n = 217.
2) Proportion: want E = 0.03 (3 percentage points), 95% confidence, conservative p = 0.5
– n = (1.96² × 0.25) / 0.03² ≈ 0.9604 / 0.0009 ≈ 1067.1 → round up to n = 1,068.
3) Finite population correction (when population N is not large):
– Adjusted n = n0 / [1 + (n0 − 1)/N]
– Example: if N = 5,000 and n0 = 1,068 then n_adj ≈ 1,068 / (1 + 1,067/5,000) ≈ 1,068 / 1.2134 ≈ 880.

Checklist: constructing and reporting a CI
1. State the parameter and confidence level (e.g., 95% CI for mean μ).
2. Choose the appropriate formula (z, t, proportion, bootstrap, etc.).
3. Verify assumptions (normality, sample size, independence, randomness).
4. Compute point estimate and SE; find critical value (z* or t*).
5. Compute CI and round sensibly; report units and sample size.
6. Interpret in frequentist terms; state limitations and assumptions.

Common misinterpretations (short)
– A 95% CI does NOT mean there’s a 95% probability the true value lies inside this one interval (frequentist view). It means the method produces intervals that contain the true value 95% of the time in repeated sampling.
– Two overlapping CIs do not automatically imply no statistically significant difference (and non-overlap is a conservative test).
– Narrower CI = more precision, but does not imply accuracy if bias is present.
– Using the wrong formula (e.g., z when t is required, or normal approximation for tiny n) gives misleading intervals.

Bootstrap percentile CI — quick steps
1. From your original sample of size n, draw B bootstrap samples (sample with replacement), each of size n (B = 1,000–10,000 typical).
2. For each bootstrap sample compute the statistic (mean, median, proportion, etc.).
3. Order the B bootstrap statistics and take the (α/2) and (1−α/2) percentiles for a (1−α)×100% CI (e.g., 2.5th and 97.5th percentiles for 95%).
4. Advantages: works with complex statistics and without strong parametric assumptions; watch for bias and skew

and for a sufficiently large number of bootstrap replicates (B) to get stable percentile estimates.

Disadvantages and common refinements
– Disadvantages: bootstrap percentiles can be biased (the bootstrap distribution may be shifted relative to the true sampling distribution) and can perform poorly with very small samples, extreme skew, or highly discrete data. The percentile method also ignores uncertainty in the standard error estimate.
– Bias‑corrected and accelerated (BCa) interval: adjusts for both median bias (shift) and skewness in the bootstrap distribution. It uses two correction constants (z0 for bias and a for acceleration) to shift the percentile cutoffs. The BCa interval often gives better coverage than the simple percentile method.
– Studentized (or bootstrap‑t) interval: computes a t‑like statistic for each bootstrap sample (statistic minus estimate, divided by estimated standard error), then uses percentiles of that distribution. This typically improves performance because it accounts for varying standard errors across samples.
– Choice among methods depends on sample size, skewness, and computation time. For many practical problems, BCa or studentized bootstrap is preferred over the plain percentile method.

Worked numeric example (step‑by‑step)
Data (n = 10): 5, 7, 9, 4, 6, 8, 10, 7, 6, 5
1. Compute sample mean and sample standard deviation.
– Sum = 67, sample mean x̄ = 67 / 10 = 6.7.
– Compute squared deviations, sum = 31.10, sample variance s^2 = 31.10 / (n−1) = 31.10 / 9 ≈ 3.456, so s ≈ 1.858.
2. t‑based 95% CI for the mean (assume approximate normality):
– t_(0.025,9) ≈ 2.262.
– Margin = t * s / sqrt(n) = 2.262 * 1.858 / √10 ≈ 1.33.
– CI: 6.7 ± 1.33 = [5.37, 8.03].
3. Bootstrap percentile CI (concept

3. Bootstrap percentile CI (concept and numeric example)

Concept — The bootstrap percentile method constructs a confidence interval (CI) by resampling the data with replacement many times, computing the statistic of interest (here, the sample mean) on each resample, and taking empirical percentiles of the resulting bootstrap distribution. For a two‑sided 95% CI you take the 2.5th and 97.5th percentiles of the bootstrap sampling distribution of the mean.

Step‑by‑step (how you would do it in practice)
– Draw B bootstrap resamples from the original sample (each resample is n observations drawn with replacement). Typical choices: B = 2,000 to 10,000 (more for higher precision).
– For each resample b = 1,…,B compute the bootstrap mean x̄*b.
– Sort the B bootstrap means.
– The percentile 95% CI is [x̄*_(0.025