Strata Academy

AI and research appraisal – what works, what does not

AI screening, extraction, and chat tools vs framework-aligned critical appraisal (ROB 2, AMSTAR 2, GRADE)

Quick answer

AI can help with search brainstorming, deduplication, and screening prioritisation at scale, but generic chat tools cannot reproducibly apply official appraisal frameworks (ROB 2, AMSTAR 2, GRADE). High-stakes inclusion, risk-of-bias, and certainty judgements still require accountable human reviewers using citable methods.

1. Where AI appears in research workflows

Artificial intelligence now touches almost every stage of evidence synthesis and critical appraisal. Tools appear in search term generation, reference deduplication, title/abstract prioritisation, full-text retrieval assistance, structured data extraction, plain-language summarisation, and manuscript scoring.

Capabilities differ sharply by task. Low-risk uses – suggesting MeSH synonyms or formatting citations – tolerate occasional errors. High-stakes steps – deciding which studies enter a meta-analysis, judging ROB 2 domains, or assigning GRADE certainty – require accountable human reviewers who can defend decisions in a viva, journal club, or guideline panel.

Regulators and publishers increasingly expect transparency. PRISMA 2020 extensions for searching, Cochrane methods guidance on machine learning classifiers, and university misconduct policies all push toward explicit reporting of when AI was used and how errors were controlled.

Students should separate 'productivity assistance' from 'methodological judgement'. The former can save time; the latter determines whether your review or appraisal is scientifically defensible.

2. AI for literature screening

Systematic review searches in clinical medicine often return 5,000–50,000 records after deduplication. Manual dual screening of every record is correct but slow. Machine learning classifiers trained on a subset of human decisions can rank abstracts by predicted relevance, letting reviewers focus first on likely includes.

Cochrane and Campbell collaborations have published methods for integrating active learning into review workflows. The classifier is retrained as reviewers label more records, improving prioritisation over time. This reduces workload; it does not eliminate the need for two independent humans to make final inclusion decisions in rigorous reviews.

Validation matters. Report the tool (e.g. Rayyan AI, ASReview, custom model), training set size, performance metrics (sensitivity, specificity), and how many records were screened manually versus deprioritised. A classifier trained on one clinical topic may fail on another without retraining.

False negatives are the critical risk: a relevant study buried in the 'unlikely' pile may never be seen. Conservative workflows keep human review of borderline and high-sensitivity slices, or use AI only to order records rather than exclude them unseen.

3. AI for data extraction

Large language models can draft extraction tables from PDFs: sample sizes, effect estimates, confidence intervals, and subgroup labels. This is seductive because extraction is tedious and error-prone even for experienced reviewers.

Documented failure modes are serious. Models hallucinate numbers that do not appear in the paper, swap baseline and follow-up values, misread multi-panel table footnotes, and confuse standard errors with standard deviations. A single wrong hazard ratio propagates directly into a meta-analysis pooled estimate.

Best practice treats AI extraction as a first draft only. Two independent human extractors should still verify against the source PDF, with a third resolver for discrepancies – the same standard Cochrane expects without AI. Some teams use AI to highlight candidate table cells, not to populate final fields unsupervised.

Report extraction AI in your methods: model name, version date, prompts or pipeline, validation sample, and error rate on a held-out set of papers. Supervisors and journals increasingly ask for this detail.

4. Why generic chat fails at ROB 2 and AMSTAR 2

Critical appraisal requires three steps generic chat tools skip: (1) identify the true study design, (2) map it to the correct official framework, and (3) work through signalling questions to produce auditable domain judgements. Chat models optimise for fluent prose, not reproducible algorithm-guided ratings.

Common errors include applying ROB 2 to observational cohort studies, conflating CONSORT reporting with ROB 2 risk of bias, inventing signalling questions that sound official but are not, and giving an overall 'moderate quality' gut score without domain-level reasoning.

Session variability is a reproducibility problem. The same PDF uploaded twice may yield different ROB domains flagged or different AMSTAR 2 item ratings. Supervisors, examiners, and journal clubs cannot verify a chat transcript the way they can verify a completed ROB 2 Excel tool or StrataResearch structured export.

Framework-aligned appraisal tools encode routing rules explicitly: RCT → ROB 2, non-randomised intervention → ROBINS-I, systematic review → AMSTAR 2 + PRISMA + ROBIS, diagnostic accuracy → QUADAS-2. Outputs should be stable, exportable, and tied to named domains – not a single narrative paragraph.

5. Hallucination, bias, and training-data limits

Generative models predict likely text, not verified facts. They may cite papers that do not exist, misattribute findings to the wrong trial, or smooth over contradictions between abstract and methods. In appraisal, this creates false confidence – a well-written but wrong bias assessment.

Training data lag means recent trials, corrections, and retractions may be mishandled. Models also reflect biases in the literature they were trained on: English-language dominance, over-representation of positive results, and under-representation of global health evidence.

Privacy and confidentiality matter for unpublished manuscripts, theses, and hospital quality-improvement reports. Pasting identifiable patient series or embargoed results into public chat tools may breach ethics or journal policy even when the tool feels 'private'.

Universities are updating academic integrity guidance: undisclosed AI authorship of appraisal sections may constitute misconduct. When AI assists your workflow, disclose it in methods or acknowledgements and keep your own verified worksheets.

6. Appropriate uses of AI in student projects

Appropriate: brainstorming PICO synonyms before librarian review; deduplicating EndNote libraries; prioritising screening queues inside dual human review; drafting plain-language summaries after you have completed manual appraisal; checking grammar and structure.

Appropriate with caution: structured extraction drafts verified by two humans; comparing your manual ROB 2 sheet against a framework-aligned tool to catch missed domains.

Inappropriate: submitting AI-generated ROB 2 or AMSTAR 2 tables without reading the paper; using chat output as your only appraisal document; letting AI decide final inclusion in a systematic review; citing AI-fabricated references.

Coursework markers reward transparent methodology. A short methods note – 'We used [tool] to prioritise screening; all includes were confirmed by two reviewers' – is stronger than hiding AI use or overclaiming automation.

7. What journals and guidelines expect

PRISMA 2020 and extensions emphasise reproducible search and selection. If AI changes who sees which records, that must be reported with enough detail for replication.

GRADE working group guidance stresses that certainty ratings must follow documented downgrades for risk of bias, inconsistency, and imprecision – not model-generated adjectives.

Medical journal editors (e.g. JAMA, BMJ, Lancet family) increasingly publish editorial policies on generative AI: typically allowing assisted writing with disclosure, but requiring human accountability for scientific content and data.

For student work, follow your university's research integrity code. When in doubt, ask your supervisor before using AI on inclusion, extraction, or appraisal judgements.

8. Framework-aligned AI appraisal

StrataResearch is purpose-built for manuscript appraisal rather than open-ended chat. Study-type detection routes PDFs to ROB 2, ROBINS-I, AMSTAR 2, PRISMA, QUADAS-2, or GRADE-aligned pathways. Outputs include per-domain judgements, limitations, and structured scores you can export for journal club slides or coursework appendices.

The design goal is reproducibility within a defined framework – not a generic essay about strengths and weaknesses. You still read the paper; the tool forces explicit domain coverage you might skip in a rushed manual worksheet.

Strata Lite offers a free signed-in abstract check (study type and domain signal) before committing tokens to full manuscript analysis. Quick analysis supports pay-per-use full PDF scoring without subscription.

Compare StrataResearch output to your manual ROB 2 or AMSTAR 2 worksheet on the same paper. Discrepancies are learning opportunities – they show where reporting was ambiguous or where you missed a domain.

9. Practical checklist before using AI on a review

Before adopting any AI step, ask: Could an error here change which studies enter the synthesis or how bias is rated? If yes, keep humans in the loop with verification and conflict resolution.

Document the tool name, version, date, and exact role (prioritisation vs final decision). Run a small validation sample and record error types.

Ensure PRISMA counts still reconcile after AI-assisted screening. Ensure extracted numbers match PDFs before meta-analysis.

For appraisal, prefer framework-structured tools over chat for marked coursework. Keep evidence (PDFs, worksheets, exports) in case your marker asks how you reached each judgement.

Interactive version (quizzes, walkthroughs) loads when JavaScript is enabled.