Strata Academy

QUADAS-2 Checklist: How to Appraise Diagnostic Studies

Four QUADAS-2 domains, applicability concerns, patient flow / verification bias, and STARD reporting — step-by-step for journal club

Quick answer

Appraising diagnostic accuracy studies means applying the QUADAS-2 checklist for risk of bias and applicability across four domains (patient selection, index test, reference standard, flow/timing), paired with STARD for reporting — start with patient flow before trusting sensitivity and specificity.

Start with patient flow — partial verification is a common fatal flaw.
QUADAS-2 rates bias and applicability separately for each domain.
STARD reporting gaps often correspond to QUADAS-2 bias signals.
Translate accuracy into pre-test probability and likelihood ratios for clinical use.

1. Define the diagnostic question

Before opening the PDF, specify the index test (the test under evaluation), the reference standard (gold standard comparator), the target condition, and the intended use setting. A troponin study in ED chest pain differs materially from a research-laboratory assay in asymptomatic screening.

Diagnostic accuracy studies estimate sensitivity, specificity, predictive values, likelihood ratios, or AUC. They are not intervention trials — ROB 2 does not apply. QUADAS-2 is the Cochrane-recommended risk-of-bias tool; STARD 2015 is the reporting guideline.

Clarify whether you need to know if the test works (accuracy) or whether it changes management (impact). This guide focuses on accuracy appraisal; test-treatment pathways require different evidence frameworks.

Index test — what is being evaluated?
Reference standard — is it credible and applied consistently?
Target condition — how was disease defined?
Setting — primary care, secondary care, or case-enriched specialist centre?
Threshold — pre-specified cut-off or data-driven optimisation?

2. STARD reporting scan (first pass)

STARD provides 30 items covering title, abstract, methods, results, and discussion for diagnostic accuracy studies. Use it as a structured reading guide before deep appraisal — missing items often flag bias domains.

Priority STARD items for students: patient recruitment and sampling (Item 5), index test and reference standard descriptions (Items 10–13), patient flow with numbers at each stage (Item 17), and cross-tabulation of results (Item 18).

STARD completeness does not guarantee low risk of bias. A well-reported case–control diagnostic design may still be high applicability concern. Reporting and bias are complementary assessments.

Read abstract — do sensitivity/specificity match the clinical question?
Locate patient flow diagram — count enrolled, tested, verified, analysed
Identify reference standard — same for all patients?
Check index test blinding — could knowledge of reference standard influence interpretation?
Find 2×2 table — raw true/false positive and negative counts

3. QUADAS-2 domain appraisal

QUADAS-2 evaluates four domains, each rated for risk of bias (low / high / unclear) and separately for applicability concerns (low / high / unclear). Work through signalling questions in order — do not assign domain scores without completing the algorithm.

Patient selection: was the patient spectrum representative of the intended use population? Case–control designs that assemble known cases and healthy controls often inflate accuracy and warrant high applicability concern.

Index test: was interpretation blind to reference standard results? Were thresholds pre-specified? Data-driven cut-offs optimised on the same dataset inflate apparent performance.

Reference standard: is it the best available comparator? Was it applied regardless of index test result? Differential verification — different standards for different patients — threatens validity.

Flow and timing: adequate interval between tests? Did all enrolled patients receive both tests? Partial verification (only positives get reference standard) is classic verification bias.

Note: Do not collapse applicability and risk of bias into a single 'quality score'. A study can be internally low bias yet not applicable to your NHS pathway.

4. Patient flow and verification bias

Draw or reconstruct the flow diagram before interpreting sensitivity and specificity. Ask: of all patients who had the index test, how many had reference standard verification?

Partial verification bias occurs when only patients with positive index tests proceed to reference standard. This inflates both sensitivity and specificity because false negatives and false positives are undercounted.

Differential verification applies different reference standards to different subgroups — for example, CT for high-risk patients and clinical follow-up for low-risk patients. Rate high risk of bias in flow and timing unless adequately corrected.

Incorporation bias occurs when the index test forms part of the reference standard definition — common when comparing imaging modalities against composite clinical diagnosis that already incorporated the index result.

Count patients at each stage — enrolled, index tested, reference verified, analysed
Flag exclusions after index test — were they differential?
Check interval between tests — disease may evolve
Note withdrawals and uninterpretable results

5. Interpreting results clinically

Sensitivity and specificity alone are insufficient for bedside decisions. Convert results to likelihood ratios (LR+ and LR−) and apply them to pre-test probability using Fagan nomogram logic or Bayesian framing.

Predictive values depend on prevalence — high specificity in a low-prevalence population still yields many false positives if the test is applied indiscriminately. Always ask: what was the prevalence in this study, and does it match my patients?

Confidence intervals around sensitivity and specificity widen rapidly in small studies. A reported sensitivity of 95% (95% CI 85–99%) with n=50 diseased patients carries more uncertainty than the point estimate suggests.

BMJ Statistics Notes and CEBM resources provide accessible introductions to likelihood ratios and diagnostic reasoning — pair statistical interpretation with QUADAS-2 judgements.

Tip: See our diagnostic statistics guide for likelihood ratio calculations and post-test probability worked examples.

6. Diagnostic systematic reviews

When synthesising multiple diagnostic accuracy studies, complete QUADAS-2 per included study and present summary tables. Cochrane recommends bivariate or hierarchical models for pooled sensitivity and specificity — not simple pooling of proportions without accounting for correlation.

Investigate heterogeneity in patient spectrum and index test version. A pooled estimate mixing primary care and tertiary referral populations may be clinically meaningless even if statistically estimable.

GRADE for diagnostic test accuracy follows a structured approach considering study limitations (QUADAS-2), inconsistency, indirectness, imprecision, and publication bias. Report certainty alongside pooled estimates.

7. Common diagnostic appraisal mistakes

These errors show up repeatedly in journal club and coursework — fix them before trusting sensitivity or specificity numbers.

Applying ROB 2 to a diagnostic accuracy study instead of QUADAS-2.
Collapsing risk of bias and applicability into one 'quality score'.
Trusting accuracy from case–control designs without flagging spectrum bias.
Ignoring partial verification (only positives get the reference standard).
Reporting sensitivity/specificity without prevalence context or likelihood ratios.
Treating STARD completeness as proof of low bias.

Frequently asked questions

What is the QUADAS-2 checklist?

QUADAS-2 is the Cochrane-recommended checklist for risk of bias and applicability in diagnostic accuracy studies. It covers patient selection, index test, reference standard, and flow/timing — each rated separately for bias and applicability concerns.

How do you score QUADAS-2?

Work through signalling questions domain by domain, then assign low, high, or unclear risk of bias and separately rate applicability. Key threats include partial verification, differential reference standards, lack of blinding, and data-driven thresholds.

What is the difference between sensitivity and specificity?

Sensitivity is the proportion of people with the disease who test positive. Specificity is the proportion without the disease who test negative. Both depend on the chosen threshold and patient spectrum — report them with confidence intervals, not in isolation.

How do you judge applicability in QUADAS-2?

Applicability asks whether the study population, index test, and reference standard match your clinical question. Case–control designs, enriched specialist populations, or outdated comparators often warrant high applicability concern even when internal bias is low.

Interactive walkthroughs and quizzes load when JavaScript is enabled — the checklist and tables above are fully readable without it.