Strata Academy

P-values and confidence intervals

What they mean, what they do not mean, and how to report them

Quick answer

Report effect sizes with 95% confidence intervals, not p-values alone. A p-value describes compatibility with the null under a model – not the probability the finding is true or clinically important.

1. Start with the estimand, not the p-value

Every analysis estimates something: a difference in means, an odds ratio, a hazard ratio, or a risk difference. The estimand is the precise quantity the authors intend to measure – for example, the effect of assignment to treatment at 12 weeks in all randomised participants (intention-to-treat).

The effect size answers 'how big?'; the confidence interval answers 'how precisely?'; the p-value answers 'how compatible are these data with a null effect under the chosen model?'

Leading journals, the American Statistical Association, and Cochrane discourage decisions based solely on whether p < 0.05 without effect sizes, intervals, and clinical context.

In critical appraisal, misaligned estimands – per-protocol analysis presented as primary when ITT was registered – are a more serious flaw than any single p-value.

2. What a p-value is (and is not)

Under a specified statistical model and null hypothesis, the p-value is the probability of obtaining results at least as extreme as those observed, if the null hypothesis were true and assumptions held.

It is not the probability that the null hypothesis is true, nor that your finding is 'real', reproducible, or clinically important. Those require design quality, external evidence, and magnitude of effect.

A small p-value can arise from a tiny effect in a huge sample or from model misspecification. A large p-value can reflect low power rather than absence of effect.

The common misinterpretation 'p = 0.04 means 4% chance this is a fluke' is wrong and should not appear in coursework or presentations.

3. Confidence intervals

A 95% confidence interval is a range of effect sizes compatible with the data and model, at the 95% level under repeated sampling assumptions. It should be reported with every main estimate.

For a difference, if the 95% CI excludes zero, it typically aligns with a two-sided p < 0.05 under standard Wald tests. For ratios (OR, HR), exclusion of 1 is the analogous rule.

Ask whether the entire interval lies within a clinically acceptable range – not only whether it excludes the null. An OR 0.95 (0.90–1.00) may be 'significant' in a huge trial but trivial for practice.

Confidence intervals communicate imprecision – a key GRADE downgrade domain. Wide intervals around subgroup estimates often mean 'do not overinterpret'.

4. When appraising a paper

Check that primary outcomes were pre-specified in the protocol or registry. Post-hoc promotion of a secondary outcome undermines the p-value's interpretability.

Look for multiplicity – many secondary outcomes, subgroup analyses, or interim looks – without adjustment, hierarchical testing, or clear labelling as exploratory.

Prefer papers that report estimates and intervals in the abstract, not only 'p < 0.05' or 'significant'. AMA and CONSORT-aligned abstracts include numeric results with CIs.

Bayesian or alternative analyses should be presented transparently alongside frequentist results, with priors and sensitivity analyses stated if used.

5. Common student errors

Equating non-significant with 'proves no difference' – absence of evidence is not evidence of absence.

Ignoring baseline imbalance in small RCTs because 'p > 0.05 for baseline table'.

Citing p-values from subgroup tests without interaction p-values.

Accepting post-hoc per-protocol analysis as primary because it has a smaller p-value.

Interactive version (quizzes, walkthroughs) loads when JavaScript is enabled.