Strata Academy

Sample Size & Statistical Power for Students

Why size matters for interpreting negative and positive findings

Quick answer

Power is the chance of detecting a real effect of a prespecified size. Underpowered studies produce inconclusive 'negative' results – wide CIs, not proof of no difference.

Power = 1 − β: the probability of detecting a clinically meaningful effect if it truly exists — typically set at 80% or 90%.
Sample size must be justified before recruitment; post-hoc power after a null result does not rescue a failed study.
A non-significant p-value with wide confidence intervals is inconclusive — absence of evidence is not evidence of absence.
Effect size in the calculation should be clinically meaningful, not the smallest difference statistics can detect with infinite n.
Subgroup analyses, cluster trials, and non-inferiority designs need specialised sample size reasoning — generic n-per-group statements may mislead.

1. Core concepts

Sample size and statistical power sit at the heart of trial design — and at the heart of many appraisal mistakes. Before reading results, ask whether the study was large enough to answer the question the authors later claim.

Power is the probability of detecting an effect of a given size if it truly exists, usually set at 80% or 90%. A study with 50% power is essentially a coin flip for finding a real effect — yet underpowered studies remain common in clinical research, particularly in rare diseases, subgroup analyses, and early-phase exploratory trials.

Pre-specified sample size calculations should state the primary outcome, expected effect size, alpha (significance level), power, and assumed control event rate or standard deviation. These assumptions belong in the protocol and trial registry before the first patient is enrolled.

Effect size in the calculation should be clinically meaningful — the smallest difference that would change practice — not merely the smallest difference a huge sample could detect. Detecting a 0.1 mmHg blood pressure difference with n = 50,000 is statistically possible but wasteful and clinically irrelevant.

Type I error (α) is the false-positive rate — typically 0.05 for two-sided tests. Type II error (β) is a false negative — failing to detect a real effect. Power equals 1 − β. Confusing these errors is a frequent exam trap: α governs how often you declare an effect when none exists; β governs how often you miss one that does.

Type I error (α): false positive — usually set at 0.05 (two-sided)
Type II error (β): false negative — power = 1 − β
Effect size should be clinically meaningful, not only statistically detectable
Power 80% means 20% chance of missing a real effect of the target size
Precision (CI width) is the appraisal-facing consequence of sample size

Term	Definition	Typical value
α (alpha)	Probability of false positive	0.05
β (beta)	Probability of false negative	0.10–0.20
Power (1 − β)	Probability of detecting true effect	0.80–0.90
Effect size	Clinically important difference	Context-specific

2. What goes into a sample size calculation

Sample size formulae differ by outcome type and design, but the logic is consistent: you specify how much certainty you want, how big an effect matters, and how much variability to expect — then solve for n.

For binary outcomes (event vs no event), calculations need control-group event rate, anticipated treatment effect (absolute or relative), α, and power. A rare outcome requires larger n than a common one for the same relative risk reduction.

For continuous outcomes (mean difference), you need an estimate of standard deviation, the minimum clinically important difference, α, and power. SD is often taken from pilot data or prior trials — if underestimated, the trial will be underpowered.

For time-to-event outcomes, the number of events (not just participants) drives power. Accrual period, follow-up duration, and dropout all affect how many events you observe.

Adjustments for anticipated dropout inflate n so that the analysable sample still meets power targets. Intention-to-treat analysis with 20% missing primary outcome data effectively reduces power unless pre-planned imputation preserves estimand integrity.

Binary outcomes: need baseline event rate and target effect size
Continuous outcomes: need SD and minimum clinically important difference
Survival outcomes: power depends on number of events
Dropout inflation: recruit more than the analysable n required
Cluster trials: inflate for intracluster correlation coefficient (ICC)

Tip: If authors report sample size but not the assumed effect size, you cannot judge whether the trial was designed to detect a clinically important difference.

3. Worked example: binary outcome trial

Consider a hypothetical RCT of a new oral anticoagulant versus standard care for preventing stroke in atrial fibrillation. The primary outcome is stroke within 12 months. This mirrors the kind of calculation you might reconstruct in coursework or explain in a viva.

Suppose prior data suggest a 6% stroke rate on standard care. Clinicians deem a reduction to 4% (absolute risk reduction 2%, relative reduction ~33%) clinically worthwhile. With α = 0.05 (two-sided) and power 80%, a standard formula yields roughly 2,400 participants per arm — about 4,800 total before dropout.

If the trial enrols only 500 participants and finds stroke rates of 5.5% vs 6.5% (non-significant p = 0.72), what can you conclude? Not that the drugs are equivalent. The confidence interval for the absolute difference might span from a large benefit to a large harm — inconclusive. The study was underpowered for the 2% ARR target.

Now suppose the same 500-participant trial finds 2% vs 6% (p = 0.01). Significant — but scrutinise whether the observed effect is implausibly large compared with the pre-specified target (winner's curse). Small underpowered trials that happen to be significant often overestimate true effects.

This example illustrates the core appraisal move: connect the reported result to the pre-specified target effect and the precision of the estimate — not only to the p-value.

Scenario	Result	Correct interpretation
n = 4,800; 4% vs 6%	p < 0.05	Consistent with pre-specified worthwhile effect
n = 500; 5.5% vs 6.5%	p = 0.72	Inconclusive — underpowered for 2% ARR
n = 500; 2% vs 6%	p = 0.01	Significant but may overestimate true effect

4. What to check in a manuscript

Was sample size justified before recruitment began? The calculation should appear in the protocol, registry entry, or methods section with enough detail to reproduce. Post-hoc power calculations after a non-significant result are generally discouraged — they do not recover a failed study and are mathematically redundant given the observed confidence interval.

For cluster randomised trials, crossover designs, or non-inferiority trials, specialised methods apply. Generic 'n per group' statements may hide inflation factors for clustering, within-subject correlation, or non-inferiority margins. CONSORT extensions exist for each design.

Multiplicity adjustments for several primary outcomes or interim analyses should be reflected in the sample size if claimed. A trial powered for one primary outcome may be underpowered for co-primary outcomes promoted in the abstract.

Early stopping rules, if used, should be pre-specified with statistical monitoring plans. Stopping early for benefit can inflate false positives if boundaries were not planned; stopping for futility is different and often appropriate.

Recruitment feasibility matters in appraisal: if a trial needed 10,000 participants but enrolled 800 over five years, the effective question may have shifted from the registered primary outcome to whatever was achievable.

Recruitment stopped early for benefit? (can inflate false positives if not planned)
Intention-to-treat with high dropout — effective sample size may be smaller
Subgroup analyses often underpowered by default
Adaptive designs require pre-specified adaptation rules
Compare registered primary outcome and target effect to final report

Note: Post-hoc power is not a substitute for interpreting the confidence interval. If the 95% CI excludes clinically important benefit, the trial is inconclusive for that question — regardless of what a post-hoc power calculation claims.

5. Special designs and common traps

Non-inferiority and equivalence trials frame the question differently from superiority trials. Non-inferiority aims to show a new treatment is not unacceptably worse than standard care by more than a pre-specified margin. Equivalence (uncommon) aims to show effects lie within a band around zero. 'Non-significant' in a superiority trial does not mean 'non-inferior'.

Cluster randomised trials randomise groups (GP practices, wards) rather than individuals. Effective sample size depends on the intracluster correlation coefficient — similar patients within clusters reduce independent information. An ICC of 0.05 can double or triple required cluster count.

Crossover trials use each participant as their own control, reducing required n for continuous outcomes — but require no carryover effect, minimal dropout between periods, and conditions that return to baseline. Appraisal should check whether these assumptions hold.

Observational studies rarely have formal a priori power calculations. Instead, discuss precision: how wide are the confidence intervals, and do they exclude confounding-residual effects of plausible magnitude? Large cohorts can still be imprecise for rare outcomes.

Pilot and feasibility studies are not powered for definitive effect estimates unless explicitly designed as such. Promoting a pilot's secondary outcome as significant without correction is a common reporting error.

Design	Sample size nuance	Appraisal question
Superiority RCT	Standard α, power, effect size	Was target effect clinically meaningful?
Non-inferiority	Non-inferiority margin pre-specified	Is margin clinically acceptable?
Cluster RCT	ICC inflates required clusters	Was clustering accounted for in analysis?
Crossover	Within-person correlation reduces n	Carryover and dropout plausible?

6. Interpreting 'negative' trials

A non-significant p-value with wide confidence intervals is inconclusive, not proof of no difference. This distinction is central to critical appraisal and to safe clinical reasoning. Absence of evidence is not evidence of absence — a phrase examiners expect you to use correctly.

Ask whether the interval excludes clinically important harm or benefit. A hazard ratio of 1.05 (95% CI 0.85–1.30) for mortality is 'non-significant' but does not exclude a 15% relative increase or decrease — potentially decisive for a serious outcome.

Non-inferiority and equivalence trials use different framing — they aim to show an effect lies within a margin, not that two treatments are identical. Appraising them with superiority-trial logic leads to systematic errors.

Observational studies rarely have formal power calculations; imprecision should be discussed via CI width and clinical context. A propensity-score matched cohort of 200 may be precise for common outcomes and useless for rare ones.

CASP and ROB 2 both intersect here: was the study large enough to answer the question the authors later claim? Imprecision is an explicit GRADE downgrade domain in systematic reviews — link sample size appraisal to certainty ratings when synthesising evidence.

Wide CI crossing clinically important thresholds → inconclusive
NNT and absolute risk reduction clarify clinical importance better than p alone
Futility stopping is not the same as proving equivalence
Bayesian analyses may quantify probability of benefit — still need sensible priors

7. Common mistakes

Treating p > 0.05 as 'no effect' without inspecting confidence intervals remains the most common student error in journal clubs and written coursework. Always report effect size and interval alongside significance.

Accepting post-hoc power as reassurance after a null result is methodologically discouraged by statisticians and major trial guidelines. The observed CI already tells you what effects are compatible with the data.

Ignoring dropout when interpreting ITT results reduces effective sample size and can bias estimates if missingness is differential between arms — link to our missing data guide and ROB 2 Domain 3.

Assuming a significant result from an underpowered study is robust invites overconfidence. Winner's curse and publication bias mean small significant trials often overestimate effects that shrink on replication.

Powering a trial for a surrogate outcome while claiming definitive evidence on a hard clinical endpoint is a design–claim mismatch — common in early-phase research and worth flagging in appraisal.

Did authors conflate non-significant with 'no difference'?
Is post-hoc power presented as justification?
Does the CI exclude effects you would act on?
Was the registered primary outcome adequately powered?
Are subgroup claims supported by interaction tests and plausible n?

Frequently asked questions

What is a 'good' statistical power?

Most trials target 80% or 90% power for the primary outcome. Eighty per cent means a one-in-five chance of missing a real effect of the pre-specified size. Higher power requires larger n. There is no universal rule for observational studies, where precision is judged from confidence interval width.

Can I calculate power after the study finishes?

Post-hoc power using the observed effect size is discouraged — it is largely determined by the observed p-value and adds little beyond the confidence interval. Pre-specified power belongs in the design phase. Exception: blinded sample size re-estimation during recruitment is pre-planned, not post-hoc.

Does a non-significant result prove treatments are equal?

No. It means the data are compatible with no effect (and with a range of other effects) at the chosen α level. Unless the trial was designed and powered as a non-inferiority or equivalence study with an appropriate margin, 'no difference' is not established.

Why do subgroup analyses often mislead?

Splitting a adequately powered trial into subgroups divides sample size and power. Many subgroup findings are false positives or exaggerated effects. Require a pre-specified interaction test, plausible biological rationale, and replication before acting on subgroup results.

How does sample size relate to GRADE imprecision?

GRADE downgrades for imprecision when confidence intervals are wide enough that reasonable clinicians would disagree about the direction or magnitude of effect, or when optimal information size is not met in meta-analysis. Link your sample size appraisal to whether the estimate is precise enough for decision-making.

Interactive walkthroughs and quizzes load when JavaScript is enabled — the checklist and tables above are fully readable without it.