Back to BlogEvidence-Based Medicine

Why Clinical AI Tools Hallucinate Citations — And How to Verify Them

Ailva Team9 min read

The Citation Problem in Clinical Tools

Last month, a colleague asked me to check a citation from an AI-generated clinical summary. The paper it referenced — a 2023 meta-analysis in The Lancet — did not exist. The authors were real. The journal was real. The conclusion was plausible. But the study was fabricated. I searched PubMed, CrossRef, Google Scholar. Nothing. A fake paper that looked perfect on a screen.

That anecdote became a data point in January 2025, when a Stanford team published a study in JAMA Network Open that quantified the problem. They submitted 200 clinical questions to four commercially available tools and checked every citation in every response. Twenty-eight percent of cited references did not exist. Not misquoted, not outdated — fabricated entirely. Fake authors, fake journals, fake DOIs that resolved to 404 pages.

The citations looked plausible. They followed proper formatting. They referenced real journals — The New England Journal of Medicine, The Lancet, Annals of Internal Medicine. The author names were often composites of real researchers in the field, blended in ways that would pass a casual glance but fail a PubMed search. One fabricated reference attributed a meta-analysis on PCSK9 inhibitors to a cardiologist who had published real work on statins — close enough to seem credible, wrong enough to be dangerous.

What are hallucinated citations?

Hallucinated citations are references in language-model-generated text that do not correspond to real published papers. They include plausible-sounding author names, real journal titles, and properly formatted DOIs, but the cited study does not exist in PubMed or any indexed database. The Stanford study found 28% of citations in clinical responses from general-purpose language models were complete fabrications. A physician who reads a fabricated citation and adjusts a treatment plan is making decisions on evidence that does not exist. That is not a technology problem — it is a patient safety problem.

Why Language Models Fabricate References

Understanding why this happens requires understanding what a language model does when it generates a citation. It is not looking up a database. It is not querying PubMed. It is predicting the most likely next sequence of characters based on patterns learned during training. When the model has learned from millions of papers that clinical answers typically include references in a specific format, it generates text matching that pattern — whether or not the specific reference exists.

The model has learned the form of a citation without anchoring it to the substance. It knows a cardiology claim is typically supported by a reference in Circulation or JACC, authored by someone whose name appears frequently in that literature, published within a plausible date range. So it constructs one. The construction is often internally consistent — the journal scope matches the topic, the publication year is reasonable, the author names exist in the field. That internal consistency is precisely what makes these fabrications so difficult to detect without active verification.

The Confidence Calibration Problem

The second layer of the problem: fabricated citations arrive with the same tone and formatting as verified ones. No asterisk. No confidence score. No hedge language. To the physician reading between patients, every citation looks equally authoritative.

A 2025 preprint from University of Michigan School of Medicine quantified this. Researchers asked 120 residents across internal medicine, surgery, and emergency medicine to evaluate AI-generated clinical summaries. Residents identified fabricated citations only 33% of the time — roughly the accuracy of a coin flip. When they did flag a citation as suspicious, they were correct only 54% of the time. The residents who reported the highest confidence in AI tools performed worst at detecting fabrications. Let that sink in.

The Anatomy of a Hallucinated Citation

Consider a concrete example. A physician asks about the safety of apixaban in patients with severe hepatic impairment and receives a response citing:

Martinez-Vega R, Chen WL, et al. "Apixaban Pharmacokinetics and Bleeding Risk in Child-Pugh C Cirrhosis: A Multicenter Prospective Cohort." Hepatology. 2024;79(3):612-624.

This citation is plausible in every dimension. Hepatology publishes pharmacokinetic studies. The topic is clinically relevant. The author names are realistic. The volume and page numbers follow the journal's format. But a PubMed search returns nothing. The DOI does not resolve. The study does not exist.

Now imagine what happens next. The response claims the study showed "no significant increase in major bleeding events (HR 1.12, 95% CI 0.84-1.49)" in Child-Pugh C patients. A physician might feel reassured enough to prescribe apixaban in a population where real data is limited and caution is warranted. The fabricated evidence fills a gap that the genuine literature has deliberately left open. That gap exists for a reason.

Three Categories of Citation Errors

Not all citation errors carry the same risk. Three distinct patterns emerge:

  • Complete fabrication. The reference does not exist in any form. Authors, title, journal, data — all generated. This is the most dangerous category because there is no real study to even partially corroborate the claim.
  • Attribution errors. A real study exists, but the model attributes findings to the wrong paper or wrong authors. The clinical claim may be approximately correct, but the evidentiary chain is broken — you cannot verify the specific finding in the cited source.
  • Effect size distortion. The model cites a real study but reports incorrect effect sizes, confidence intervals, or outcome measures. A real trial showing a non-significant trend becomes, in the AI response, a statistically significant finding. This is arguably the subtlest and most dangerous form — the citation passes a PubMed existence check, and you would need to read the actual paper to catch the error.

A 2025 analysis by Shen et al. in BMJ Evidence-Based Medicine broke it down: among AI-generated medical responses with citations, 28% contained complete fabrications, 19% had attribution errors, and 14% distorted effect sizes. Only 39% of citations were fully accurate — correct source, correct data, correct interpretation. Fewer than four in ten. That is the baseline you are working with when you use an unverified clinical tool.

What Verification Actually Requires

Better prompting will not fix this. Fine-tuning will not fix this. The problem requires a fundamentally different architecture — one that separates evidence retrieval from language generation.

A verification system for clinical citations needs to do four things:

  • Existence confirmation. Every cited paper must verifiably exist in an indexed database — PubMed, Embase, Cochrane, or equivalent. A DOI that resolves. A PubMed ID that returns a record. This eliminates complete fabrications.
  • Claim-to-source matching. The specific clinical claim attributed to a citation must appear in that source. Saying "Smith et al. showed a 40% reduction in mortality" is only valid if Smith et al. actually reported that finding. This requires parsing the original paper, not just confirming it exists.
  • Effect size fidelity. When a response reports specific numbers — hazard ratios, odds ratios, confidence intervals, NNT — those numbers must match the source paper. Rounding HR 0.83 to "approximately 20% reduction" is acceptable. Changing HR 0.83 to HR 0.67 is not.
  • Recency and relevance flagging. A valid citation from 2008 may have been superseded by a 2024 trial that changed the standard of care. Verification should include whether the cited evidence remains current and whether more recent data have altered the conclusion.

The Manual Verification Burden

Some physicians have adopted a manual verification workflow: copy the citation, paste it into PubMed, check if it exists, skim the abstract to confirm the claimed finding. This works. But look at the time cost.

A typical AI-generated clinical response contains 5 to 12 citations. Manually verifying each one takes 2 to 4 minutes — searching PubMed, scanning the abstract, cross-referencing the claimed data point. For a response with 8 citations, that is 16 to 32 minutes of verification. The entire premise of using a clinical decision support tool was to save time, not to create a verification burden that exceeds what you would have spent on UpToDate in the first place.

This is the core tension: clinical tools need citations for credibility, but those citations only have value if they are accurate. A tool that generates plausible-looking but unverified citations is arguably worse than one that provides no citations at all, because it creates false confidence in evidence that may not exist.

What Physicians Should Expect in 2026

The standard should be straightforward: every citation verified before it reaches the physician. Not after. Not on request. Not as a separate manual step. Verification built into the system that generates the response.

When evaluating any clinical tool that provides evidence-based responses, ask three questions:

  • Does the tool verify that every cited paper actually exists in an indexed database?
  • Does the tool confirm that the specific claim attributed to a citation appears in the source paper?
  • Can I click on any citation and see the original paper — not a generated summary of what the paper supposedly says, but the actual source?

If the answer to any of these is no, the tool has a citation reliability problem, regardless of how sophisticated its language generation may be.

Ailva verifies every citation against an index of over 5 million peer-reviewed papers before it reaches the physician. Citations that cannot be confirmed are excluded, and the response acknowledges the gap rather than filling it with a fabrication. Given the data above, that is not a feature — it is a baseline requirement. See how Ailva delivers verified clinical answers.

Beyond citation accuracy, the next frontier for clinical tools is cross-system reasoning that connects evidence across specialties — and doing so with the same commitment to verification. For a broader framework on evaluating these tools, see our guide to what to look for in a clinical decision support tool in 2026.

Want to try Ailva?

Ailva is a clinical intelligence platform that delivers evidence-based answers with verified citations and cross-system reasoning. Free for all NPI holders.