Strata Academy

AI for Systematic Reviews: What Works (and What Fails)

AI screening, extraction, and chat tools vs framework-aligned critical appraisal (ROB 2, AMSTAR 2, GRADE)

Quick answer

AI can help with search brainstorming, deduplication, and screening prioritisation at scale, but generic chat tools cannot reproducibly apply official appraisal frameworks (ROB 2, AMSTAR 2, GRADE). High-stakes inclusion, risk-of-bias, and certainty judgements still require accountable human reviewers using citable methods.

Screening AI: useful as prioritisation inside dual human review – not as sole inclusion engine.
Extraction AI: first-draft only – verify every number against the source PDF.
Appraisal AI: framework routing and signalling questions need structured tools, not open chat.
Disclose AI use in methods; supervisors expect reproducible audit trails.

1. Where AI appears in research workflows

Artificial intelligence now touches almost every stage of evidence synthesis and critical appraisal. Tools appear in search term generation, reference deduplication, title/abstract prioritisation, full-text retrieval assistance, structured data extraction, plain-language summarisation, and manuscript scoring.

Capabilities differ sharply by task. Low-risk uses – suggesting MeSH synonyms or formatting citations – tolerate occasional errors. High-stakes steps – deciding which studies enter a meta-analysis, judging ROB 2 domains, or assigning GRADE certainty – require accountable human reviewers who can defend decisions in a viva, journal club, or guideline panel.

Regulators and publishers increasingly expect transparency. PRISMA 2020 extensions for searching, Cochrane methods guidance on machine learning classifiers, and university misconduct policies all push toward explicit reporting of when AI was used and how errors were controlled.

Students should separate 'productivity assistance' from 'methodological judgement'. The former can save time; the latter determines whether your review or appraisal is scientifically defensible.

Search & discovery – synonym expansion, database translation, grey-literature suggestions
Screening – active learning, semantic similarity, duplicate detection
Extraction – table parsing, outcome identification, effect-size drafting
Appraisal – study-type routing, domain scoring, narrative synthesis drafts
Writing – grammar, structure, plain-language summaries after your analysis is complete

2. AI for literature screening

Systematic review searches in clinical medicine often return 5,000–50,000 records after deduplication. Manual dual screening of every record is correct but slow. Machine learning classifiers trained on a subset of human decisions can rank abstracts by predicted relevance, letting reviewers focus first on likely includes.

Cochrane and Campbell collaborations have published methods for integrating active learning into review workflows. The classifier is retrained as reviewers label more records, improving prioritisation over time. This reduces workload; it does not eliminate the need for two independent humans to make final inclusion decisions in rigorous reviews.

Validation matters. Report the tool (e.g. Rayyan AI, ASReview, custom model), training set size, performance metrics (sensitivity, specificity), and how many records were screened manually versus deprioritised. A classifier trained on one clinical topic may fail on another without retraining.

False negatives are the critical risk: a relevant study buried in the 'unlikely' pile may never be seen. Conservative workflows keep human review of borderline and high-sensitivity slices, or use AI only to order records rather than exclude them unseen.

Pilot 50–100 records to calibrate eligibility before scaling AI-assisted screening
Blind reviewers to authors and journals where feasible
Document disagreements and resolution – same standard as non-AI screening
Export PRISMA counts from your tool; numbers must reconcile with the manuscript

Tip: See our literature screening guide for dual-review workflow and PRISMA flow integration.

3. AI for data extraction

Large language models can draft extraction tables from PDFs: sample sizes, effect estimates, confidence intervals, and subgroup labels. This is seductive because extraction is tedious and error-prone even for experienced reviewers.

Documented failure modes are serious. Models hallucinate numbers that do not appear in the paper, swap baseline and follow-up values, misread multi-panel table footnotes, and confuse standard errors with standard deviations. A single wrong hazard ratio propagates directly into a meta-analysis pooled estimate.

Best practice treats AI extraction as a first draft only. Two independent human extractors should still verify against the source PDF, with a third resolver for discrepancies – the same standard Cochrane expects without AI. Some teams use AI to highlight candidate table cells, not to populate final fields unsupervised.

Report extraction AI in your methods: model name, version date, prompts or pipeline, validation sample, and error rate on a held-out set of papers. Supervisors and journals increasingly ask for this detail.

Note: Never cite an effect size from an AI extraction without manually confirming it in the paper.

4. Why generic chat fails at ROB 2 and AMSTAR 2

Critical appraisal requires three steps generic chat tools skip: (1) identify the true study design, (2) map it to the correct official framework, and (3) work through signalling questions to produce auditable domain judgements. Chat models optimise for fluent prose, not reproducible algorithm-guided ratings.

Common errors include applying ROB 2 to observational cohort studies, conflating CONSORT reporting with ROB 2 risk of bias, inventing signalling questions that sound official but are not, and giving an overall 'moderate quality' gut score without domain-level reasoning.

Session variability is a reproducibility problem. The same PDF uploaded twice may yield different ROB domains flagged or different AMSTAR 2 item ratings. Supervisors, examiners, and journal clubs cannot verify a chat transcript the way they can verify a completed ROB 2 Excel tool or StrataResearch structured export.

Framework-aligned appraisal tools encode routing rules explicitly: RCT → ROB 2, non-randomised intervention → ROBINS-I, systematic review → AMSTAR 2 + PRISMA + ROBIS, diagnostic accuracy → QUADAS-2. Outputs should be stable, exportable, and tied to named domains – not a single narrative paragraph.

No stable study-type routing → wrong framework from the first step
Invented signalling questions → plausible language, non-defensible judgements
No saved audit trail → cannot defend in viva or journal club
Conflates summarisation ('what the authors did') with appraisal ('how much to trust it')
Ignores selective reporting checks against trial registries

Tip: See our detailed StrataResearch vs ChatGPT comparison for a feature table.

5. Hallucination, bias, and training-data limits

Generative models predict likely text, not verified facts. They may cite papers that do not exist, misattribute findings to the wrong trial, or smooth over contradictions between abstract and methods. In appraisal, this creates false confidence – a well-written but wrong bias assessment.

Training data lag means recent trials, corrections, and retractions may be mishandled. Models also reflect biases in the literature they were trained on: English-language dominance, over-representation of positive results, and under-representation of global health evidence.

Privacy and confidentiality matter for unpublished manuscripts, theses, and hospital quality-improvement reports. Pasting identifiable patient series or embargoed results into public chat tools may breach ethics or journal policy even when the tool feels 'private'.

Universities are updating academic integrity guidance: undisclosed AI authorship of appraisal sections may constitute misconduct. When AI assists your workflow, disclose it in methods or acknowledgements and keep your own verified worksheets.

6. Appropriate uses of AI in student projects

Appropriate: brainstorming PICO synonyms before librarian review; deduplicating EndNote libraries; prioritising screening queues inside dual human review; drafting plain-language summaries after you have completed manual appraisal; checking grammar and structure.

Appropriate with caution: structured extraction drafts verified by two humans; comparing your manual ROB 2 sheet against a framework-aligned tool to catch missed domains.

Inappropriate: submitting AI-generated ROB 2 or AMSTAR 2 tables without reading the paper; using chat output as your only appraisal document; letting AI decide final inclusion in a systematic review; citing AI-fabricated references.

Coursework markers reward transparent methodology. A short methods note – 'We used [tool] to prioritise screening; all includes were confirmed by two reviewers' – is stronger than hiding AI use or overclaiming automation.

Disclose AI assistance in dissertation methods
Keep primary PDFs and manual worksheets as source of truth
Never upload patient-identifiable full texts to public models
Prefer tools with versioned outputs over ephemeral chat sessions

7. What journals and guidelines expect

PRISMA 2020 and extensions emphasise reproducible search and selection. If AI changes who sees which records, that must be reported with enough detail for replication.

GRADE working group guidance stresses that certainty ratings must follow documented downgrades for risk of bias, inconsistency, and imprecision – not model-generated adjectives.

Medical journal editors (e.g. JAMA, BMJ, Lancet family) increasingly publish editorial policies on generative AI: typically allowing assisted writing with disclosure, but requiring human accountability for scientific content and data.

For student work, follow your university's research integrity code. When in doubt, ask your supervisor before using AI on inclusion, extraction, or appraisal judgements.

8. Framework-aligned AI appraisal

StrataResearch is purpose-built for manuscript appraisal rather than open-ended chat. Study-type detection routes PDFs to ROB 2, ROBINS-I, AMSTAR 2, PRISMA, QUADAS-2, or GRADE-aligned pathways. Outputs include per-domain judgements, limitations, and structured scores you can export for journal club slides or coursework appendices.

The design goal is reproducibility within a defined framework – not a generic essay about strengths and weaknesses. You still read the paper; the tool forces explicit domain coverage you might skip in a rushed manual worksheet.

Strata Lite offers a free signed-in abstract check (study type and domain signal) before committing tokens to full manuscript analysis. Quick analysis supports pay-per-use full PDF scoring without subscription.

Compare StrataResearch output to your manual ROB 2 or AMSTAR 2 worksheet on the same paper. Discrepancies are learning opportunities – they show where reporting was ambiguous or where you missed a domain.

Study-type routing reduces framework mismatch
Domain-level output supports supervision and group teaching
Exportable reports – not ephemeral chat threads
Complements official Cochrane tools; does not replace reading signalling questions

9. What supervisors and examiners expect

UK medical school supervisors increasingly ask where AI was used in student reviews and appraisals. Expect to show screening logs, extraction verification sheets, or framework worksheets — not chat transcripts.

Marks reward transparent disclosure: 'Rayyan AI prioritised records; two reviewers confirmed all includes' beats undocumented screening. Marks penalise AI-generated ROB 2 paragraphs with no domain evidence.

If your faculty provides AI policy, cite it in your methods. When policy is silent, default to disclosure and human verification for any step that affects conclusions.

Viva questions may include: 'Show me where you verified this hazard ratio' or 'Which ROB 2 domain drove your overall judgement?' AI cannot answer that — your worksheet must.

Keep PDFs and manual worksheets as source of truth
Disclose tool name, version, and role in methods
Never submit chat output as appendix without verification
Compare AI draft to framework tool for learning, not shortcutting

10. Practical checklist before using AI on a review

Before adopting any AI step, ask: Could an error here change which studies enter the synthesis or how bias is rated? If yes, keep humans in the loop with verification and conflict resolution.

Document the tool name, version, date, and exact role (prioritisation vs final decision). Run a small validation sample and record error types.

Ensure PRISMA counts still reconcile after AI-assisted screening. Ensure extracted numbers match PDFs before meta-analysis.

For appraisal, prefer framework-structured tools over chat for marked coursework. Keep evidence (PDFs, worksheets, exports) in case your marker asks how you reached each judgement.

11. Task-by-task: AI fit vs human requirement

Not every workflow step benefits equally from AI. Use this framing in dissertation methods and journal club when colleagues overclaim 'AI did our review'.

Task	AI role	Human requirement
Search synonyms	Brainstorm MeSH candidates	Librarian validates strategy
Deduplication	Automated matching	Spot-check false merges
Title/abstract screening	Prioritisation / active learning	Dual independent final decisions
Data extraction	Draft table cells	Dual verification against PDF
ROB 2 / AMSTAR 2	Framework tool draft signal	Domain judgements with audit trail
Manuscript write-up	Grammar and structure	Scientific content accountable to authors

12. Worked example: ChatGPT vs structured appraisal

This exercise takes 45 minutes and is ideal SSC evidence synthesis coursework. The learning outcome is not 'AI is useless' — it is 'high-stakes appraisal steps need citable, stable frameworks'.

Frequently asked questions

Can I use ChatGPT for ROB 2 in my dissertation?

Only as an unofficial draft you verify against the official ROB 2 tool and the paper itself. Submit your completed ROB 2 worksheet or framework-aligned export — not raw chat text. Disclose AI use in methods if faculty policy requires it.

Is Rayyan AI the same as letting ChatGPT screen papers?

No. Rayyan AI uses active learning within a screening platform with audit logs. You still make inclusion decisions. ChatGPT pasted abstracts lack reproducible screening records and PRISMA reconciliation.

Do journals accept AI-assisted systematic reviews?

Many allow AI for prioritisation and drafting with disclosure. Final inclusion, extraction verification, and risk-of-bias judgements remain author-accountable. Check target journal AI policy.

Can AI replace dual screening?

No for rigorous systematic reviews. AMSTAR 2 treats absence of duplicate screening as a critical flaw. AI may prioritise order, not replace independent reviewers.

How does StrataResearch differ from ChatGPT for appraisal?

StrataResearch routes study type to official frameworks and exports domain-level judgements. ChatGPT produces variable narrative without stable routing or audit trails. You still read the paper — the tool structures coverage.

Interactive walkthroughs and quizzes load when JavaScript is enabled — the checklist and tables above are fully readable without it.