Strata Academy

Sensitivity, Specificity & Likelihood Ratios Explained

Q: When is AUC misleading?

When the reported AUC comes from data-driven threshold optimisation without external validation, from case–control designs, or when the clinically used cut-off differs from the optimal ROC point. Always inspect sensitivity and specificity at the relevant threshold.

Q: Should I use QUADAS-2 for AI diagnostic studies?

Yes — QUADAS-2 applies to index tests including AI classifiers. Additionally use CLAIM for AI-specific reporting. Applicability concerns often arise when training data differ from your local population, scanners, or disease spectrum.

Core metrics, pre-test probability, and pairing with QUADAS-2 appraisal

Quick answer

Sensitivity and specificity describe test performance; likelihood ratios update pre-test probability to post-test probability. Always pair statistical interpretation with QUADAS-2 bias assessment and STARD reporting.

Sensitivity and specificity are intrinsic test properties at a given threshold — clinical usefulness depends on pre-test probability.
Likelihood ratios update probability: LR+ = sensitivity / (1 − specificity); LR− = (1 − sensitivity) / specificity.
Predictive values (PPV/NPV) depend on prevalence — do not generalise PPV from a specialist clinic to primary care.
Partial verification and case–control designs inflate accuracy — use QUADAS-2 before trusting headline sensitivity.
ROC AUC summarises discrimination but hides the chosen threshold — always ask which cut-off was used clinically.

1. Core metrics: sensitivity, specificity, and predictive values

Diagnostic accuracy studies compare an index test against a reference standard to estimate how well the test classifies patients with and without the target condition. The statistics look simple — but are among the most frequently misapplied in clinical practice and student coursework.

Sensitivity is the proportion of people with the condition who test positive on the index test: true positives / (true positives + false negatives). A highly sensitive test misses few cases — useful for 'ruling out' disease when negative (SnNout: Sensitive test, Negative rules out).

Specificity is the proportion without the condition who test negative: true negatives / (true negatives + false positives). A highly specific test generates few false positives — useful for 'ruling in' disease when positive (SpPin: Specific test, Positive rules in).

These properties are intrinsic to the test at a given threshold, but their clinical usefulness depends on pre-test probability — how common the condition is in your population and clinical context. The same test behaves differently in a GP surgery versus a tertiary referral clinic.

Positive predictive value (PPV) and negative predictive value (NPV) depend directly on prevalence. PPV = true positives / all positive tests; NPV = true negatives / all negative tests. High sensitivity does not guarantee high PPV in low-prevalence screening — most positive screens may be false positives.

Likelihood ratios combine sensitivity and specificity into measures that update pre-test probability: LR+ = sensitivity / (1 − specificity); LR− = (1 − sensitivity) / specificity. They are the preferred metrics for clinical reasoning because they transport across settings better than PPV/NPV alone.

Predictive values (PPV/NPV) depend on prevalence — do not generalise across settings
LR+ > 10 and LR− < 0.1 produce large shifts in probability (rough anchors)
Pre-specify the threshold that defines a 'positive' test before results are known
Indeterminate results should be reported — not silently excluded
Confidence intervals for sensitivity/specificity matter in small studies

Metric	Formula (concept)	Depends on prevalence?
Sensitivity	TP / (TP + FN)	No
Specificity	TN / (TN + FP)	No
PPV	TP / (TP + FP)	Yes
NPV	TN / (TN + FN)	Yes
LR+	Sens / (1 − Spec)	No
LR−	(1 − Sens) / Spec	No

2. Pre-test and post-test probability

Clinical diagnosis is sequential updating. You start with a pre-test probability based on prevalence, presenting features, and referral pathway. The test result — via likelihood ratio — moves you to a post-test probability at which you decide whether to treat, refer, or reassure.

Apply the likelihood ratio using odds form: pre-test odds × LR = post-test odds. Convert between probability and odds: odds = p / (1 − p). Nomograms (Fagan nomogram) and online calculators perform this — examiners often test LR interpretation without providing PPV/NPV, expecting you to update probability manually.

A LR+ of 10 roughly multiplies odds of disease by 10; a LR− of 0.1 divides odds by 10. Anchors from evidence-based medicine: LR+ > 10 and LR− < 0.1 produce large shifts; LR+ 5–10 and LR− 0.1–0.2 moderate shifts; values closer to 1 shift probability minimally.

In low-prevalence screening (e.g. coeliac serology in low-risk adults), even a highly specific test can yield poor PPV — many false positives among the well. Confirmatory testing and Bayesian updating across sequential tests become essential.

In high-prevalence specialist clinics (e.g. cardiology chest pain clinic), sensitivity dominates — missed cases are costly. A test with 95% sensitivity still misses one in twenty cases; clinical follow-up may still be required.

Students should practise one worked example per coursework module. Below is a structured approach you can replicate in OSCE reasoning stations.

Tip: See our QUADAS-2 framework guide for bias domains that distort sensitivity and specificity estimates.

3. Worked example: applying a likelihood ratio

Suppose a 55-year-old smoker presents to GP with atypical chest pain. Your clinical assessment suggests 20% pre-test probability of coronary artery disease before exercise ECG.

Exercise ECG shows ST depression — reported LR+ ≈ 5 for CAD in this context. Pre-test odds = 0.20 / 0.80 = 0.25. Post-test odds = 0.25 × 5 = 1.25. Post-test probability = 1.25 / (1 + 1.25) ≈ 56%.

The test shifted probability from 20% to roughly 56% — a meaningful but not definitive increase. Further testing (imaging, cardiology referral) may still be appropriate depending on local pathways and patient preferences.

If the ECG had been normal (LR− ≈ 0.3), post-test odds = 0.25 × 0.3 = 0.075; post-test probability ≈ 7%. Probability dropped substantially but did not reach zero — 'ruled out' in clinical language still requires context.

This example shows why LRs are preferred over quoting sensitivity alone: the same test result changes probability differently depending on where you start. Examination questions frequently test this arithmetic without providing PPV.

Step	Value	Calculation
Pre-test probability	20%	From clinical assessment
Pre-test odds	0.25	0.20 / 0.80
LR+ (positive ECG)	5	From validation study
Post-test odds	1.25	0.25 × 5
Post-test probability	~56%	1.25 / (1 + 1.25)

4. Design features to appraise

Diagnostic accuracy is only as trustworthy as the study design. A test with excellent sensitivity on paper may perform poorly when verification bias, spectrum bias, or partial reference standard application distort the estimates.

Patients should represent the intended clinical spectrum — not only severe cases from tertiary centres if the test will be used in primary care. Spectrum bias inflates or deflates accuracy when the study sample does not match the target population.

All enrolled patients should receive the reference standard, or differential verification must be explained and adjusted for. Partial verification — applying the gold standard only to those with positive index tests — is a classic flaw that inflates sensitivity and specificity.

Index test and reference standard should be interpreted blindly where feasible. Knowledge of one result can bias interpretation of the other (review bias). In practice, blinding is difficult for some paired tests — authors should state what blinding was achieved.

Case–control diagnostic designs (known disease vs known healthy controls) often inflate sensitivity and specificity because cases and controls are selected for known disease status. Prospective cohort or cross-sectional designs in consecutive patients are stronger.

Index tests with indeterminate results should report how these were handled. Excluding indeterminate results from analysis inflates apparent accuracy — check the participant flow diagram.

Consecutive enrolment reduces selection bias
Same reference standard for all participants — or justified differential verification
Blinding of index and reference test interpreters
Timing between tests short enough that disease status is stable
Appropriate spectrum of disease severity and comorbidity

Note: Verification bias occurs when only patients with positive index tests receive the gold standard — a classic flaw in retrospective diagnostic studies. QUADAS-2 Domain 4 (flow and timing) targets this.

5. ROC curves, AUC, and threshold choice

ROC curves plot sensitivity against 1 − specificity across all possible cut-offs for a continuous or ordinal index test. They visualise the trade-off: lowering the threshold increases sensitivity but decreases specificity, and vice versa.

The area under the curve (AUC) summarises overall discrimination — the probability that a randomly chosen diseased patient has a higher test value than a randomly chosen non-diseased patient. AUC = 0.5 is chance; 0.7–0.8 acceptable; 0.8–0.9 excellent; >0.9 may signal overfitting or spectrum issues.

AUC hides the clinically chosen operating point. Ask which threshold was used for the reported sensitivity and specificity, and whether it was pre-specified. A high AUC with a suboptimal clinical threshold may still misclassify patients at the cut-off you would actually use.

For continuous biomarkers, report how the cut-off was derived — data-driven optimisation on the same dataset without external validation overfits and inflates accuracy. Validation in an independent cohort is the gold standard.

Compare AUC only across studies with similar patient spectrums and reference standards. AUC from a case–control biomarker study is not comparable to AUC from a consecutive cohort in primary care.

Decision curve analysis (when reported) evaluates net benefit across threshold probabilities — increasingly seen in machine learning diagnostic papers. It asks whether using the test improves decisions compared with treat-all or treat-none strategies.

ROC curve: sensitivity vs 1 − specificity at all thresholds
AUC: overall discrimination — not a substitute for threshold analysis
Pre-specify clinical cut-off before unblinding results
External validation required for data-driven thresholds

6. QUADAS-2 and STARD

Use QUADAS-2 domains for risk of bias and applicability when appraising diagnostic accuracy studies. Four domains — patient selection, index test, reference standard, flow and timing — are rated separately for risk of bias and for applicability concerns.

Risk of bias and applicability are distinct judgments. A study may be at low risk of bias in the index test domain but raise applicability concerns if patients differ from your NHS population.

STARD (Standards for Reporting Diagnostic Accuracy Studies) supports complete reporting: title, abstract, methods (study design, participants, test methods), results (flow, cross-tabulation, estimates with CIs), and discussion. STARD 2015 is the current reference.

A well-reported STARD paper can still have high QUADAS-2 bias if verification was partial or the spectrum was wrong. Reporting completeness and methodological quality are different lenses — use both.

For AI imaging classifiers, pair QUADAS-2 with CLAIM (Checklist for AI in Medical Imaging) reporting expectations. External validation on different scanners and sites is essential — internal validation AUCs are often optimistic.

Do not apply ROB 2 (randomised trials) to diagnostic accuracy studies — it is the wrong framework. QUADAS-2 or QUADAS-C (comparative studies) are appropriate.

QUADAS-2 domain	Key question	Common flaw
Patient selection	Consecutive, representative spectrum?	Case–control selection bias
Index test	Blinded, pre-specified threshold?	Post-hoc optimal cut-off
Reference standard	Correct, blinded, same for all?	Partial verification
Flow and timing	All enrolled? Appropriate interval?	Dropouts excluded from analysis

7. Common student mistakes

Quoting sensitivity without prevalence context is the most common error. 'Sensitivity 99%' sounds excellent until you calculate PPV in a population where prevalence is 0.1% — most positives may still be false.

Using PPV from a tertiary cohort to counsel primary-care patients ignores prevalence differences. Always reconstruct post-test probability for your setting using LRs and local pre-test estimates.

Treating AUC as sufficient without threshold analysis ignores the clinically relevant operating point. Report sensitivity and specificity at the cut-off you would use — not only AUC.

Applying ROB 2 to diagnostic accuracy studies is a category error — use QUADAS-2 instead.

Ignoring indeterminate test results in the analysis population inflates accuracy. Check the flow diagram for exclusions.

Confusing screening and diagnostic accuracy questions: a test validated for diagnosis in symptomatic patients may perform differently when used for population screening — applicability matters.

Did I estimate pre-test probability for my clinical setting?
Did I use LRs rather than PPV from a different prevalence?
Is the reference standard acceptable and applied to all?
Was the threshold pre-specified and clinically relevant?
Did I complete QUADAS-2 before trusting the headline metrics?

Frequently asked questions

What is a good sensitivity or specificity?

It depends on clinical context. Rule-out tests need high sensitivity (few false negatives); rule-in tests need high specificity (few false positives). There is no universal threshold — balance depends on disease severity, test harms, and pre-test probability.

Why are likelihood ratios better than predictive values?

LRs describe test performance independent of prevalence and can update pre-test probability in any setting. PPV and NPV change when prevalence changes — a PPV from a specialist clinic cannot be applied directly in primary care.

What LR values produce clinically useful shifts?

Rough anchors: LR+ > 10 or LR− < 0.1 produce large probability shifts; LR+ 5–10 or LR− 0.1–0.2 moderate shifts; values near 1 change probability minimally. Exact post-test probability still depends on pre-test probability.

When is AUC misleading?

When the reported AUC comes from data-driven threshold optimisation without external validation, from case–control designs, or when the clinically used cut-off differs from the optimal ROC point. Always inspect sensitivity and specificity at the relevant threshold.

Should I use QUADAS-2 for AI diagnostic studies?

Yes — QUADAS-2 applies to index tests including AI classifiers. Additionally use CLAIM for AI-specific reporting. Applicability concerns often arise when training data differ from your local population, scanners, or disease spectrum.

Interactive walkthroughs and quizzes load when JavaScript is enabled — the checklist and tables above are fully readable without it.