Strata Academy
Sample size and statistical power
Why size matters for interpreting negative and positive findings
Quick answer
Power is the chance of detecting a real effect of a prespecified size. Underpowered studies produce inconclusive 'negative' results – wide CIs, not proof of no difference.
1. Core concepts
Power is the probability of detecting an effect of a given size if it truly exists, usually set at 80% or 90%. Underpowered studies are common in clinical research.
Pre-specified sample size calculations should state the primary outcome, expected effect size, alpha, power, and assumed control event rate or standard deviation.
Effect size in the calculation should be clinically meaningful – detecting a trivial difference with huge n is statistically possible but wasteful.
Type I error (α) is the false-positive rate (typically 0.05). Type II error (β) is a false negative; power = 1 − β.
- Type I error (α): false positive – usually set at 0.05
- Type II error (β): false negative – power = 1 − β
- Effect size should be clinically meaningful, not only statistically detectable
2. What to check in a manuscript
Was sample size justified before recruitment began? Post-hoc power calculations after a non-significant result are generally discouraged – they do not recover a failed study.
For cluster trials, crossover designs, or non-inferiority trials, specialised methods apply. Generic 'n per group' statements may hide inflation factors for clustering.
Multiplicity adjustments for several primary outcomes should be reflected in the sample size if claimed.
Early stopping rules, if used, should be pre-specified with statistical monitoring plans.
- Recruitment stopped early for benefit? (can inflate false positives if not planned)
- Intention-to-treat with high dropout – effective sample size may be smaller
- Subgroup analyses often underpowered by default
3. Interpreting 'negative' trials
A non-significant p-value with wide confidence intervals is inconclusive, not proof of no difference. Ask whether the interval excludes clinically important harm or benefit.
Non-inferiority and equivalence trials use different framing – they aim to show an effect is not worse than a margin, not that two treatments are identical.
Observational studies rarely have formal power calculations; imprecision should be discussed via CI width and clinical context.
CASP and ROB 2 both intersect here: was the study large enough to answer the question the authors later claim?
4. Common mistakes
Treating p > 0.05 as 'no effect' without inspecting CIs.
Accepting post-hoc power as reassurance after a null result.
Ignoring dropout when interpreting ITT results.
Assuming a significant result from an underpowered study is robust – winner's curse can inflate effect sizes.
Interactive version (quizzes, walkthroughs) loads when JavaScript is enabled.