Strata Academy
P-values and confidence intervals
What they mean, what they do not mean, and how to report them
Quick answer
Report effect sizes with 95% confidence intervals, not p-values alone. A p-value describes compatibility with the null under a model – not the probability the finding is true or clinically important.
1. Start with the estimand, not the p-value
Every analysis estimates something: a difference in means, an odds ratio, a hazard ratio, or a risk difference. The estimand is the precise quantity the authors intend to measure – for example, the effect of assignment to treatment at 12 weeks in all randomised participants (intention-to-treat).
The effect size answers 'how big?'; the confidence interval answers 'how precisely?'; the p-value answers 'how compatible are these data with a null effect under the chosen model?'
Leading journals, the American Statistical Association, and Cochrane discourage decisions based solely on whether p < 0.05 without effect sizes, intervals, and clinical context.
In critical appraisal, misaligned estimands – per-protocol analysis presented as primary when ITT was registered – are a more serious flaw than any single p-value.
2. What a p-value is (and is not)
Under a specified statistical model and null hypothesis, the p-value is the probability of obtaining results at least as extreme as those observed, if the null hypothesis were true and assumptions held.
It is not the probability that the null hypothesis is true, nor that your finding is 'real', reproducible, or clinically important. Those require design quality, external evidence, and magnitude of effect.
A small p-value can arise from a tiny effect in a huge sample or from model misspecification. A large p-value can reflect low power rather than absence of effect.
The common misinterpretation 'p = 0.04 means 4% chance this is a fluke' is wrong and should not appear in coursework or presentations.
- p = 0.04 does not mean 96% probability the intervention works
- Non-significant does not mean 'no effect' – often means wide intervals and insufficient precision
- Significant does not mean clinically meaningful – check absolute risk differences
- Multiple comparisons inflate false positives without adjustment or pre-specification
3. Confidence intervals
A 95% confidence interval is a range of effect sizes compatible with the data and model, at the 95% level under repeated sampling assumptions. It should be reported with every main estimate.
For a difference, if the 95% CI excludes zero, it typically aligns with a two-sided p < 0.05 under standard Wald tests. For ratios (OR, HR), exclusion of 1 is the analogous rule.
Ask whether the entire interval lies within a clinically acceptable range – not only whether it excludes the null. An OR 0.95 (0.90–1.00) may be 'significant' in a huge trial but trivial for practice.
Confidence intervals communicate imprecision – a key GRADE downgrade domain. Wide intervals around subgroup estimates often mean 'do not overinterpret'.
4. When appraising a paper
Check that primary outcomes were pre-specified in the protocol or registry. Post-hoc promotion of a secondary outcome undermines the p-value's interpretability.
Look for multiplicity – many secondary outcomes, subgroup analyses, or interim looks – without adjustment, hierarchical testing, or clear labelling as exploratory.
Prefer papers that report estimates and intervals in the abstract, not only 'p < 0.05' or 'significant'. AMA and CONSORT-aligned abstracts include numeric results with CIs.
Bayesian or alternative analyses should be presented transparently alongside frequentist results, with priors and sensitivity analyses stated if used.
- Are confidence intervals given for main comparisons?
- Is the primary outcome clearly identified and unchanged from registry?
- Are secondary analyses labelled as secondary or exploratory?
- Is missing data handled without inflating apparent precision?
- Do authors distinguish statistical from clinical significance?
5. Common student errors
Equating non-significant with 'proves no difference' – absence of evidence is not evidence of absence.
Ignoring baseline imbalance in small RCTs because 'p > 0.05 for baseline table'.
Citing p-values from subgroup tests without interaction p-values.
Accepting post-hoc per-protocol analysis as primary because it has a smaller p-value.
Interactive version (quizzes, walkthroughs) loads when JavaScript is enabled.