Strata Academy

P-Values vs Confidence Intervals — What to Report

What they mean, what they do not mean, and how to report them

Quick answer

Report effect sizes with 95% confidence intervals, not p-values alone. A p-value describes compatibility with the null under a model – not the probability the finding is true or clinically important.

Estimand first — know whether ITT or per-protocol analysis matches the question.
Confidence intervals show magnitude and precision; p-values do not.
Multiplicity and post-hoc outcomes undermine p-value interpretation.
Non-significant results can still be clinically important if CIs are wide.

1. Start with the estimand, not the p-value

Every analysis estimates something: a difference in means, an odds ratio, a hazard ratio, or a risk difference. The estimand is the precise quantity the authors intend to measure – for example, the effect of assignment to treatment at 12 weeks in all randomised participants (intention-to-treat).

The effect size answers 'how big?'; the confidence interval answers 'how precisely?'; the p-value answers 'how compatible are these data with a null effect under the chosen model?'

Leading journals, the American Statistical Association, and Cochrane discourage decisions based solely on whether p < 0.05 without effect sizes, intervals, and clinical context.

In critical appraisal, misaligned estimands – per-protocol analysis presented as primary when ITT was registered – are a more serious flaw than any single p-value.

2. What a p-value is (and is not)

Under a specified statistical model and null hypothesis, the p-value is the probability of obtaining results at least as extreme as those observed, if the null hypothesis were true and assumptions held.

It is not the probability that the null hypothesis is true, nor that your finding is 'real', reproducible, or clinically important. Those require design quality, external evidence, and magnitude of effect.

A small p-value can arise from a tiny effect in a huge sample or from model misspecification. A large p-value can reflect low power rather than absence of effect.

The common misinterpretation 'p = 0.04 means 4% chance this is a fluke' is wrong and should not appear in coursework or presentations.

p = 0.04 does not mean 96% probability the intervention works
Non-significant does not mean 'no effect' – often means wide intervals and insufficient precision
Significant does not mean clinically meaningful – check absolute risk differences
Multiple comparisons inflate false positives without adjustment or pre-specification

Note: Do not treat 0.049 and 0.051 as fundamentally different categories without context.

3. Confidence intervals

A 95% confidence interval is a range of effect sizes compatible with the data and model, at the 95% level under repeated sampling assumptions. It should be reported with every main estimate.

For a difference, if the 95% CI excludes zero, it typically aligns with a two-sided p < 0.05 under standard Wald tests. For ratios (OR, HR), exclusion of 1 is the analogous rule.

Ask whether the entire interval lies within a clinically acceptable range – not only whether it excludes the null. An OR 0.95 (0.90–1.00) may be 'significant' in a huge trial but trivial for practice.

Confidence intervals communicate imprecision – a key GRADE downgrade domain. Wide intervals around subgroup estimates often mean 'do not overinterpret'.

Tip: Ask whether the entire interval lies within a clinically acceptable range – not only whether it excludes the null.

4. When appraising a paper

Check that primary outcomes were pre-specified in the protocol or registry. Post-hoc promotion of a secondary outcome undermines the p-value's interpretability.

Look for multiplicity – many secondary outcomes, subgroup analyses, or interim looks – without adjustment, hierarchical testing, or clear labelling as exploratory.

Prefer papers that report estimates and intervals in the abstract, not only 'p < 0.05' or 'significant'. AMA and CONSORT-aligned abstracts include numeric results with CIs.

Bayesian or alternative analyses should be presented transparently alongside frequentist results, with priors and sensitivity analyses stated if used.

Are confidence intervals given for main comparisons?
Is the primary outcome clearly identified and unchanged from registry?
Are secondary analyses labelled as secondary or exploratory?
Is missing data handled without inflating apparent precision?
Do authors distinguish statistical from clinical significance?

5. Common student errors

Equating non-significant with 'proves no difference' – absence of evidence is not evidence of absence.

Ignoring baseline imbalance in small RCTs because 'p > 0.05 for baseline table'.

Citing p-values from subgroup tests without interaction p-values.

Accepting post-hoc per-protocol analysis as primary because it has a smaller p-value.

Frequently asked questions

Should I report p-values in my dissertation?

Report effect estimates with 95% confidence intervals as primary. P-values may accompany intervals but should not replace them. Many supervisors and journals follow AMA and Cochrane guidance favouring estimation over dichotomous significance.

What does p = 0.06 mean?

Under the null hypothesis and model assumptions, results at least this extreme would occur 6% of the time. It does not mean there is a 94% probability the effect is real, nor that the effect is clinically trivial.

Why do confidence intervals matter more than p-values?

Intervals show effect size and precision — whether the finding is clinically meaningful and how much uncertainty remains. P-values alone hide magnitude and are sensitive to sample size.

How does StrataResearch use p-values?

StrataResearch emphasises effect sizes, confidence intervals, and pre-specified outcomes in statistical feedback. A significant p-value with wide intervals or post-hoc outcomes still triggers warnings in appraisal output.

Interactive walkthroughs and quizzes load when JavaScript is enabled — the checklist and tables above are fully readable without it.