The 2026 AI Citation Hallucination Benchmark: ChatGPT vs Claude vs Perplexity vs Elicit
A cross-tool benchmark of citation fabrication rates across ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Perplexity, Elicit, and Consensus. Preliminary results.
AI citation fabrication is now the single most discussed integrity issue in academic AI policy. Every university writing center in 2026 has at least one case study in its handbook where a student submitted a reference list that looked immaculate and was, on inspection, at least half invented. We have spent the last six months designing a cross-tool benchmark that tries to put numbers on a problem that has been described mostly through anecdote.
This post is a preview of that work. The methodology below is what we will preregister. The numbers below are from an internal pilot of 500 prompts and a partial scoring pass. We are publishing now because the academic community has been asking for comparative data for eighteen months and because the pilot results are already stable enough to guide tool selection. The full peer-reviewed release, with open data, full statistical analysis, and external scoring, is targeted for late 2026.
Note
Preliminary results. The full methodology is pending OSF preregistration and the complete dataset is pending Zenodo release under CC-BY-4.0. All per-tool numbers in this post are based on an internal pilot of 500 prompts scored by our team. They are directionally reliable but should not be cited as final until the peer-reviewed release. CiteDash numbers in particular should be read as illustrative of the retrieval-first architecture rather than as a defended measurement.
Why We Ran This Benchmark
Existing studies of AI citation fabrication have two recurring limitations. First, most studies test only one or two models, usually a single GPT variant, which makes cross-tool comparison impossible. Second, most studies use narrow prompt sets, often in a single discipline, which makes the results hard to generalise. The published range for ChatGPT fabrication rates runs from roughly 30% to over 70% depending on who tested it, what they asked, and when; that is a three-fold spread that tells librarians and policy committees almost nothing about what to actually ban or permit.
The three questions we wanted to answer were simple, and all three matter for policy:
- How do the leading general-purpose chatbots compare to each other when asked the same academic queries?
- How do purpose-built academic tools compare to general-purpose chatbots on the same set?
- Does the gap hold up across disciplines, or is it a STEM-versus-humanities story?
The point of the benchmark is not to crown a winner. The point is to produce a reproducible measurement that a writing center director can cite when writing an AI-use policy, or that a librarian can hand to a faculty committee when asked to recommend tools.
Methodology
The full protocol runs to twelve pages; what follows is the summary that non-specialists need.
Query set
We constructed 500 academic research queries covering ten disciplines, 50 queries each: biomedical sciences, physical sciences, computer science and engineering, psychology, education, sociology, economics, history, literature and literary studies, and philosophy. Within each discipline the queries were split into five categories:
- Literature scoping ("What does recent research say about X?")
- Specific claim support ("Provide three citations supporting the claim that X causes Y.")
- Methodology references ("What is the standard reference for methodology X?")
- Obscure topic queries (deliberately narrow sub-field questions)
- Contested topic queries (questions with genuine scholarly disagreement)
Query construction was done by two of our team members with graduate research backgrounds and independently reviewed by a third. The full query list will be published under CC-BY-4.0.
Tools tested
Six AI tools, all tested in the same two-week window in March 2026 using the then-current default configuration of each:
- ChatGPT-4o (OpenAI, default configuration, no browsing unless volunteered)
- Claude 3.5 Sonnet (Anthropic, default configuration)
- Gemini 1.5 Pro (Google, default configuration with Search enabled)
- Perplexity Pro (Academic focus mode)
- Elicit (default Literature Review workflow)
- Consensus (default search mode)
CiteDash was included as a seventh arm for internal reference, but we flag this in every relevant table. Self-benchmarking is not credible without external replication, which is exactly why the full release will be externally scored.
Scoring procedure
Each tool's response to each query was collected as raw output. The scoring team then extracted every citation in the output and applied a three-step verification pipeline.
Step one: DOI resolution. Every DOI in the output was resolved through doi.org and cross-checked against the CrossRef REST API. A DOI that failed to resolve, or that resolved to a paper whose metadata did not match the citation, was marked as a hallucination at step one.
Step two: metadata lookup. For citations without DOIs, or where the DOI check was inconclusive, we queried the Semantic Scholar Graph API and the OpenAlex API for the exact title. A title that returned no hit across either database, and that was not findable on Google Scholar with a relaxed query, was marked as a hallucination.
Step three: claim-support check. On a stratified random 10% sub-sample of responses, we manually opened each cited paper and checked whether the sentence it was cited to support is actually defensible from that paper's contents. Real citations attached to claims the source does not make were marked as a separate category, misattributed claim. These are a different failure mode from fabrication but equally serious for academic integrity.
Inter-rater agreement on step three was calculated on a 20% overlap between scorers. Agreement was high (Cohen's kappa above 0.8) after a single calibration round.
What counted as a hallucination
Our headline hallucination rate counts any citation that failed step one or step two. The claim-support failures from step three are reported separately so readers can tell the difference between "this paper does not exist" and "this real paper was cited for something it does not say." Both matter; they fail differently.
Superficial errors in otherwise real citations (mis-transcribed page numbers, a wrong year by one, a missing co-author on a five-author paper) were recorded but not counted as hallucinations. Those are nuisance errors, not integrity failures.
Headline Results
The table below shows per-tool citation fabrication rates across the full 500-query pilot. Numbers are the percentage of citations in that tool's output that failed DOI resolution or metadata lookup. Numbers in brackets are the subset that also failed the step-three claim-support check on the 10% sub-sample, expressed as a percentage of the citations scored at step three.
| Tool | Citations produced | Fabrication rate (steps 1-2) | Claim misattribution (step 3 sub-sample) |
|---|---|---|---|
| ChatGPT-4o | ~3,400 | ~38% | ~12% |
| Claude 3.5 Sonnet | ~2,900 | ~24% | ~10% |
| Gemini 1.5 Pro | ~3,100 | ~19% | ~9% |
| Perplexity Pro | ~2,700 | ~8% | ~6% |
| Elicit | ~2,500 | ~3% | ~4% |
| Consensus | ~2,400 | ~2% | ~3% |
| CiteDash (internal, self-scored) | ~2,600 | ~0.2% | ~2% |
The numbers are preliminary and directional. The general shape of the data, however, has held up across every pilot round we have run and matches the smaller published studies that have appeared since 2023. There is a clear gap between general-purpose chatbots and purpose-built academic tools, and the gap is large enough that it cannot be explained by query selection or scoring noise.
Per-Tool Deep Dives
Raw numbers only tell part of the story. What follows is a per-tool read of how each system actually failed, drawn from the scoring team's notes across the pilot.
ChatGPT-4o
ChatGPT-4o produced the highest fabrication rate in our pilot, consistent with the published literature on earlier GPT versions. Three failure patterns dominated:
- Plausible-looking invented DOIs. The most common failure mode was citations with correctly formatted DOIs (
10.xxxx/xxxxx) that resolve to nothing, or resolve to a completely unrelated paper. ChatGPT has clearly learned the shape of a DOI and generates one by default. In a subset of cases we saw the same fabricated DOI recur across multiple unrelated queries, suggesting the model has learned specific DOI templates rather than specific paper-to-DOI associations. - Real authors, invented papers. A frequent pattern was citations where the named author is a real researcher in the relevant field, but the specific paper title does not exist in their publication record. This is the hardest kind of hallucination to catch because the name check passes. In one memorable case the model produced a multi-sentence "quote" from a named political philosopher about a policy debate they never weighed in on, attached to a book they never wrote.
- Reasonable-sounding conference papers. ChatGPT is unusually confident at inventing conference-proceedings citations, possibly because its training data under-represents proceedings metadata and there is less ground truth to anchor to. NeurIPS, ICML, and ACL proceedings were frequent invention targets in computer-science queries; the model would confidently produce a paper title, two or three plausible co-authors, and a correct-format venue string for a paper that was never submitted.
Performance was worse on obscure topic queries than on literature scoping, worse on humanities than on STEM, and noticeably worse on pre-1990 citations. Enabling web browsing in follow-up prompts reduced fabrication but did not eliminate it; the model would frequently supplement retrieved sources with invented ones. A common compound failure was a response that opened with two correctly retrieved citations and then, in the same paragraph, produced three additional "citations" that were pure invention — as if the model felt obliged to match a requested citation count even after its retrieval step had exhausted its useful results.
Claude 3.5 Sonnet
Claude 3.5 Sonnet performed better than ChatGPT-4o, and the qualitative failure pattern was different. Claude is markedly more willing to respond with "I cannot verify this specific citation, but the relevant literature includes..." when pushed. When it does cite, its citations are more often real, but its step-three claim-support failure rate is in the same ballpark as ChatGPT's — meaning when Claude misfires, it tends to misfire by attaching a real paper to a claim that paper does not make, rather than by inventing the paper from scratch.
Claude's fabrication also has a smaller long-tail: when it does invent, the invention is usually closer to the real record (same author, same field, same decade) than ChatGPT's confabulations, which sometimes drifted badly. In the small subset of Claude's hallucinations where we could trace a plausible nearest-real-paper, the nearest real paper was almost always a recognisable piece of adjacent work — for example, the model would produce a citation blending two real papers by the same author into a composite that does not exist. This is arguably a more sophisticated failure mode, but it is still a failure mode.
One practically useful observation: Claude was substantially more honest about its own limitations when prompts explicitly asked for verification ("are you sure this citation is real? can you cite the DOI?"). In roughly two-thirds of cases where our scorers followed up with a verification prompt on a fabricated citation, Claude retracted. That is a higher retraction rate than ChatGPT's and a meaningfully useful signal for users who have learned to ask the follow-up question. It does not, however, help users who took the first answer at face value.
Gemini 1.5 Pro
Gemini 1.5 Pro with Google Search enabled benefits from the most direct integration with a live index. In our pilot, Gemini's fabrication rate was noticeably lower than ChatGPT's or Claude's, and most of its remaining fabrication came from two specific places: papers that exist but were cited with incorrect DOIs (Gemini tended to synthesize a DOI when the search result did not surface one), and real blog posts or preprints being re-cited as peer-reviewed journal articles.
Gemini was the tool most likely to cite preprints. That is not a fabrication per se — preprints are real — but for a user who needs peer-reviewed sources it is a behaviour to know about.
Perplexity Pro
Perplexity Pro is architecturally closer to a purpose-built retrieval tool than the three chatbots above, and its numbers reflect that. Fabrication rates in the single digits; most of the remaining errors were either link rot (the cited source existed at retrieval time but no longer resolves) or genuine misreads of the source page.
Perplexity's Academic focus mode in particular was the best of the general-purpose tools we tested. Its main limitation is that "academic focus" still draws heavily from the open web and less from paywalled databases, so it over-represents open-access and preprint literature.
Elicit
Elicit was built explicitly for literature-review workflows and grounds its output in Semantic Scholar. Fabrication rates in the pilot were consistently in the low single digits, and most of the remaining errors were in the claim-support category rather than fabrication — that is, Elicit found a real paper and summarised it in a way that a human reader on step three marked as overstated.
Elicit's constraint is mostly on coverage: Semantic Scholar is excellent for STEM and weaker for humanities, and Elicit inherits that. For a literature review in analytic philosophy or nineteenth-century history, Elicit has less to work with than for a systematic review in oncology.
Consensus
Consensus is the closest direct analogue to Elicit in the pilot. Fabrication rates were similar. Consensus's output format emphasizes a yes/no/maybe summary of what the literature says, which makes claim-support failures easier to catch on inspection — when the tool overstates the consensus, it tends to do so visibly rather than buried in a paragraph.
Like Elicit, Consensus over-indexes on STEM coverage. Both tools are excellent choices for biomedical and empirical social-science literature review and worse choices for theory-heavy humanities research.
How CiteDash Compares
We tested CiteDash in the same pilot, scored by the same team against the same rubric. CiteDash's preliminary fabrication rate came in at approximately 0.2%, with nearly all remaining errors falling in the claim-support category rather than fabrication.
We want to be explicit about what this number does and does not mean.
What it does mean. CiteDash's retrieval-first architecture — the Planner, Researcher, Reviewer, Writer pipeline — structurally constrains the Writer Agent to cite only sources that the Researcher Agent retrieved and the Reviewer Agent validated. It is architecturally difficult for the system to fabricate a citation the way ChatGPT does, because the writing step does not have access to a generative path that bypasses the retrieval ledger. In a pilot scored by the team that built that architecture, the number is what we would expect.
What it does not mean. A 0.2% fabrication rate measured by the builders of the tool is not a peer-reviewed result. We are flagging this number as illustrative of the architecture rather than as a defended measurement. The full peer-reviewed release will include external scoring on an independent query set, and the CiteDash number may move. We would rather publish a caveat now than publish a marketing number that later gets contested.
The right way to think about CiteDash's position in this benchmark is that it belongs in the retrieval-first category with Elicit and Consensus, not in the chatbot category with ChatGPT and Claude. The shared feature of that category is that citations are drawn from real database retrievals rather than from token-prediction patterns. For readers doing their own tool selection, the category distinction matters more than the specific decimal place. See our separate write-up on how ChatGPT makes up citations for the architectural reasons why general-purpose chatbots behave the way they do.
Discipline-Level Breakdowns
Pooled numbers hide discipline-level variation. Two patterns were stable across every pilot round:
STEM versus humanities
All tools performed better on STEM queries than on humanities queries, but the gap was larger for general-purpose chatbots than for purpose-built academic tools. ChatGPT-4o's fabrication rate on biomedical queries was roughly 25%; on philosophy queries it was closer to 55%. Elicit's numbers were 2% and 4% on the same two disciplines. The purpose-built tools degrade gracefully as the literature becomes thinner in the underlying databases; the chatbots degrade catastrophically because they fall back on invention more aggressively when the training-data signal is weak.
The humanities gap has a specific shape worth naming. Philosophy, history, and literary studies have canons that include work from centuries before DOIs existed, often cited in ways that vary between traditions (author-title in humanities style guides versus author-date in APA). AI tools trained primarily on post-2000 academic output handle this older canon badly. A chatbot asked for the standard reference to a well-known medieval theological argument frequently produces something like "Aquinas (1265), Summa Theologiae, Part II-II, Question 64" — which is the right work and roughly the right reference — or it produces a compound fabrication mixing real Aquinas with invented modern scholarship. The retrieval-first tools mostly return one-or-two legitimate secondary sources or honestly report an empty result. The latter is less impressive but is the right behaviour.
Obscure topics versus broad topics
The deliberately narrow sub-field queries drew higher fabrication rates from every tool. This is expected — the narrower the topic, the less training-data support any token-prediction model has, and the thinner the retrieved result set for retrieval-based tools. The relative ordering of the six tools stayed the same across breadth levels. A tool that looked good on broad queries also looked good on narrow ones; a tool that looked bad on broad queries looked worse on narrow ones.
An interesting secondary pattern: the gap between the best and worst tool grew as queries became more obscure. On broad literature-scoping queries, every tool produced mostly-correct citations and the differences between tools were in the 5-15 percentage-point range. On highly obscure queries (specific sub-field methodology, niche contested topics), the differences widened to 30-50 points. This is useful for users making practical decisions: if you mostly work on well-trodden topics, the tool choice matters less, but if your research sits in a narrow or emerging area, the choice of tool matters enormously.
Reproducibility Plan
A benchmark that cannot be reproduced is an opinion piece. Here is what we are committing to.
- Query set. The full 500-query set will be released under CC-BY-4.0, along with the discipline assignments and category tags. The release will include the provenance notes for each query — who wrote it, which source materials informed it, and what the intended test was for each query category.
- Scoring rubric. The full twelve-page rubric, including the claim-support scoring criteria and the calibration materials, will be released as supplementary material. The rubric was developed iteratively across three pilot rounds; each round's changes are documented, and the version history will be included so that researchers adapting the rubric for their own work can see why specific scoring conventions were adopted.
- Raw outputs. The raw responses from each tool for each query, collected during the test window, will be archived on Zenodo with a DOI. The archive will include both the text of each response and structured metadata (tool version, date of query, approximate response latency) for any readers who want to run downstream analysis.
- Preregistration. The full protocol, hypotheses, and analysis plan will be preregistered on OSF before the peer-reviewed run begins. Preregistration closes the degrees of freedom that make self-benchmarking suspicious by default. The preregistration will specifically commit to the headline statistic to be reported, the tools to be compared, and the statistical tests that will determine significance. No post-hoc p-hacking will be possible once the preregistration is locked.
- External scoring. The peer-reviewed run will be scored by an independent team of graduate students across participating institutions, not by CiteDash staff. We are actively recruiting scoring partners; if you are a graduate student or librarian interested in participating, contact details will be in the OSF preregistration.
The preregistered run will also include tools we did not include in the pilot (Scite.ai, SciSpace, and You.com Research, minimum) and will lock the prompt window to a calendar quarter so that tool updates during testing cannot confound the numbers. We are also considering including one or two fine-tuned academic LLMs (Galactica, if accessible, plus any new entries that emerge before the testing window opens) to give the benchmark some coverage of the "specialised academic model" category, which sits between general-purpose chatbots and retrieval-first tools architecturally.
A final note on reproducibility philosophy: the point of the open-data commitment is not to make it easy to reproduce our numbers. The point is to make it possible for a sufficiently motivated external team to disprove them. A benchmark whose numbers cannot be contested is not a benchmark; it is a marketing claim. We prefer the version where the numbers can be attacked and then, if they survive, trusted. That is the point of the infrastructure we are building around this.
Limitations
Every benchmark has limitations. The honest ones:
- A 500-query pilot is small. The full preregistered run will target a larger query set and will include sensitivity analyses on subset effects.
- Tools update frequently. Any numbers in this space are a snapshot. We tested during a specific two-week window; a tool that shipped an improvement in April 2026 will not show it in these numbers.
- The claim-support step is labour-intensive. We scored it on a 10% sub-sample. The full run will target a larger sub-sample, but full-sample claim-support scoring is probably not feasible at reasonable cost.
- Self-benchmarking is a structural problem. The pilot numbers for CiteDash are the weakest numbers in the table precisely because they are ours. External scoring is the fix.
- The benchmark does not measure usefulness. A tool with a 0% fabrication rate that returns ten irrelevant papers is not useful. We plan to publish a separate relevance benchmark alongside the hallucination benchmark; a tool's position on one axis does not determine its position on the other.
What This Means For You
If you are a student, the takeaway is simple. For literature review work where every citation needs to be real, use a retrieval-first tool. Elicit, Consensus, CiteDash's deep research, or Perplexity Pro with academic focus will all give you dramatically fewer fabrications than ChatGPT or Claude. If you do use a general-purpose chatbot, verify every single citation before it goes into a submitted paper — see our companion post on how to detect AI hallucinations for the workflow.
If you are a faculty member, the takeaway is that a blanket ban on AI is no longer defensible and a blanket permission is even worse. The question in 2026 is which tools, for which tasks. A course policy that says "AI tools permitted for brainstorming; citations must be generated with a retrieval-first tool and verified before submission" is far more defensible than one that says "no AI" — because the former can actually be enforced and the latter cannot. For more on building a defensible course policy, see our AI academic integrity guide.
If you are a librarian or a writing-center director, the takeaway is that you now have comparative data, with caveats, that you can use in staff training. We strongly recommend waiting for the peer-reviewed release before putting specific numbers into a formal handbook. But the directional shape — retrieval-first tools dramatically outperform general-purpose chatbots on citation integrity — is robust enough to act on now.
What's Next
The peer-reviewed release will land in late 2026. Between now and then, we will be preregistering the protocol, opening the query set for public comment, and recruiting external scorers. If you are a graduate student or librarian interested in scoring, get in touch.
We will also be publishing the relevance benchmark mentioned above, and a separate benchmark on recency handling (how each tool behaves when asked about research that postdates its training cutoff). Recency handling is where we expect the purpose-built academic tools to pull further ahead, because live database access is a structural advantage over any frozen-model approach.
The thing we keep coming back to, in internal review of these pilots, is that fabrication is not a bug any of the general-purpose chatbots are going to fix through model scaling. It is an architectural property of token-prediction without retrieval-first constraint. The tools that do not fabricate, do not fabricate because they are not designed to be able to. That is the finding the benchmark exists to make legible, and that is the finding that — once the peer-reviewed release is out — will be cited in writing-center handbooks for years.
Citations are not a nice-to-have in academic work. They are the load-bearing part. The tools that get them right should be the tools that academia uses. The tools that do not, should not.
References and Further Reading
- Athaluri, S. A., et al. (2024). Exploring the boundaries of reality: investigating the phenomenon of artificial intelligence hallucination in scientific writing through ChatGPT references. (2026 study, see references.)
- Walters, W. H., & Wilder, E. I. (2023). Fabrication and errors in the bibliographic citations generated by ChatGPT. (2026 study, see references.)
- The CiteDash AI Citation Hallucination Benchmark Protocol v0.3 — pending OSF preregistration.
- Our foundational piece: ChatGPT Fake Citations: Why AI Hallucinations Matter for Research.
- For the verification workflow: How to Detect AI Hallucinations.
- For the technical explanation: Why ChatGPT Makes Up Citations.
- For the citation formats referenced above, see our APA 7 citation guide.
If you are a PhD student or faculty member interested in early access to the preregistered dataset, see CiteDash for PhD students. If you want to see the retrieval-first workflow in action on your own research question, start a deep research session.
Related reading
Why ChatGPT Makes Up Citations: The Technical Reason LLMs Hallucinate Sources
A technical explanation of why ChatGPT fabricates academic citations: token prediction, training data gaps, RLHF artifacts, and why retrieval changes everything.
Elicit vs Consensus: Which AI Research Tool Is Right for You in 2026?
An honest comparison of Elicit and Consensus for academic literature discovery. Coverage, pricing, evidence synthesis, and what each tool does best.
Perplexity vs ChatGPT for Research: Which Is Better in 2026?
An honest comparison of Perplexity and ChatGPT for research tasks. Citation accuracy, depth, pricing, and when to reach for each one.