What is the Citation Hallucination Benchmark?

A structured evaluation of how often leading AI research tools fabricate academic citations when asked to support claims. We run 500 prompts across 10 disciplines on 7 tools (ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Perplexity Pro, Elicit, Consensus, and CiteDash) and measure fabrication rate, claim-support rate, and source quality.

Are the numbers on this page final?

No. The numbers published here are preliminary illustrative results from our internal pilot. The full peer-reviewable study is pending OSF preregistration. Until that preregistration + a third-party reviewer pass are complete, treat the headline numbers as directional, not definitive.

How do you measure a fabricated citation?

A citation is classified as fabricated if (a) the DOI does not resolve via CrossRef, or (b) the paper cannot be matched against Semantic Scholar, OpenAlex, or PubMed by title + authors + year with reasonable fuzz. A second rater verifies a 20% stratified sample to catch false positives.

Will the dataset be open?

Yes. The query set, tool outputs, and classification labels will be published under CC-BY 4.0 on Zenodo with a DOI, so other researchers can replicate or extend the methodology. The preregistration will go on OSF before the final data collection runs.

Why did you publish preliminary results at all?

Because the fabrication problem affects students and researchers now, not after peer review. We felt the pilot results were directionally clear enough to share with caveats, while making clear this is not yet the final peer-reviewed study. Feedback on methodology is welcome before the full run.

How can I cite this work in a paper?

Use the suggested citation at the bottom of this page (APA, MLA, and Chicago formats provided). Once the Zenodo dataset is deposited, we'll update this page with the DOI and versioned citation so every version is independently citeable.

Preliminary results — peer review pending

The numbers on this page are from our internal pilot. The full peer-reviewable study is pending OSF preregistration and an independent third-party rater pass. Treat headline figures as directional; the final, archived Zenodo dataset will supersede them.

Citation Hallucination Benchmark 2026

How often do AI research tools make up academic citations? We ran 500 prompts across 10 disciplines on 7 tools to find out. Preliminary pilot numbers below; methodology and reproducibility plan follow.

Why we ran this

Citation fabrication is a well-documented problem in general-purpose AI chatbots, but public figures vary wildly (anywhere from 10% to 70% depending on the query set). There is no shared benchmark that academic librarians, faculty, or journalists can point students to when explaining why some AI tools are safer than others for citation-dependent work. We built one we could share.

The benchmark is designed to be reproducible: the full query set, raw tool outputs, and our classification labels will ship under CC-BY 4.0 on Zenodo. Other researchers can verify our numbers, extend to new tools, or run the same methodology at a later date to track change over time.

Preliminary results

Numbers below are from our pilot run. The asterisk on CiteDash indicates internal measurement — the final peer-reviewable study will include a blinded third-party rater pass to eliminate any self-evaluation bias.

Tool	Fabrication rate	Claim support	Notes
ChatGPT-4o (no browsing)	~23%	~61%	Highest fabrication rate in pilot. Gets worse on niche topics.
Claude 3.5 Sonnet	~14%	~72%	Better than ChatGPT; still hallucinates when confident.
Gemini 1.5 Pro	~18%	~64%	Mixed — sometimes retrieves, sometimes generates.
Perplexity Pro (Academic)	~14%	~78%	Retrieval-first helps; still misattributes claims.
Elicit	~3%	~88%	Retrieval-first design dramatically reduces fabrication.
Consensus	~2%	~90%	Peer-reviewed source prioritisation; very low fabrication.
CiteDash	~0.2%*	~92%*	Pilot number only. Architectural: multi-agent verification pass.

*CiteDash numbers are from internal measurement and will be re-verified by an independent rater in the peer-review pass. Do not cite the CiteDash number without noting this caveat.

Methodology

Query set

500 prompts stratified across 10 disciplines (medicine, psychology, education, economics, biology, computer science, law, history, sociology, environmental science; 50 prompts each). Prompts are worded to invite citation-backed answers, e.g. “Summarise recent research on spaced repetition effectiveness in K-12 learning, with at least 10 citations.”

The query set is versioned and will be released in full alongside the preregistered study. Anyone can re-run the benchmark on their own tool or model with the same prompts.

Tools tested

Seven tools covering the common AI-research tool classes: two general-purpose chatbots (ChatGPT-4o, Claude 3.5 Sonnet), one multi-modal assistant (Gemini 1.5 Pro), one general-web retrieval tool with academic mode (Perplexity Pro), and three retrieval-first academic tools (Elicit, Consensus, CiteDash).

Fabrication classification

DOI resolution: every citation with a DOI is resolved via the CrossRef API. If the DOI 404s, the citation is classified fabricated.
Title + author fuzzy match: for citations without DOIs, we search Semantic Scholar, OpenAlex, and PubMed. A paper must match on title substring and at least one author-year combination to be considered real.
Claim-support check: a 20% stratified sample of “real” citations is manually verified by two independent raters. Each rater reads the cited paper’s abstract and decides whether it actually supports the claim the AI attributed to it.
Inter-rater reliability: we report Krippendorff’s alpha on the manual sample. Target: α ≥ 0.8. Below that, we re-rate and discuss.

Reproducibility & open data

Preregistration: the methodology, classification rules, and analysis plan will be locked on OSF before the final data collection runs. Any deviations will be flagged in the published paper.
Dataset: 500 prompts + raw tool outputs + labels + verification notes will be deposited on Zenodo under CC-BY 4.0, with a DOI. Researchers can cite the dataset independently of this writeup.
Analysis code: Python scripts for DOI resolution, Semantic Scholar queries, and label aggregation will be published on GitHub alongside the dataset.
Rater instructions: the manual claim-support rubric will be included in the Zenodo deposit so other teams can replicate the human-verification step.

Limitations

Results reflect the tools’ state as of April 2026. Model versions change frequently; expect the fabrication rates to move with each major release.
The query set skews toward English-language academic topics. Non-English citation performance is not measured.
We measure surface fabrication (does the paper exist?) and surface support (does the abstract match?) but not deeper accuracy issues like whether the AI’s summary is a fair reading of the paper’s findings.
Tool configurations matter. ChatGPT with browsing enabled behaves differently from ChatGPT without. We report each tool in a specified configuration; running it differently will produce different numbers.

How to cite this work

Once the Zenodo dataset is deposited, versioned citations will use the DOI. Until then, use the following provisional forms:

APA 7:

CiteDash Editorial. (2026). Citation Hallucination Benchmark 2026 — Methodology & Preliminary Results. Retrieved from https://citedash.ai/benchmark/citation-hallucination-2026

MLA 9:

CiteDash Editorial. “Citation Hallucination Benchmark 2026 — Methodology & Preliminary Results.” CiteDash, 2026, citedash.ai/benchmark/citation-hallucination-2026.

Chicago:

CiteDash Editorial. “Citation Hallucination Benchmark 2026 — Methodology & Preliminary Results.” CiteDash, April 16, 2026. https://citedash.ai/benchmark/citation-hallucination-2026.

Contact

Methodological feedback, suggestions for additional tools to include, and peer-review critique are all welcome. Email benchmark@citedash.ai. We’ll respond to substantive methodology questions within 48 hours and track all feedback publicly on the OSF project page once it’s live.

Frequently asked questions

What is the Citation Hallucination Benchmark?: A structured evaluation of how often leading AI research tools fabricate academic citations when asked to support claims. We run 500 prompts across 10 disciplines on 7 tools (ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Perplexity Pro, Elicit, Consensus, and CiteDash) and measure fabrication rate, claim-support rate, and source quality.
Are the numbers on this page final?: No. The numbers published here are preliminary illustrative results from our internal pilot. The full peer-reviewable study is pending OSF preregistration. Until that preregistration + a third-party reviewer pass are complete, treat the headline numbers as directional, not definitive.
How do you measure a fabricated citation?: A citation is classified as fabricated if (a) the DOI does not resolve via CrossRef, or (b) the paper cannot be matched against Semantic Scholar, OpenAlex, or PubMed by title + authors + year with reasonable fuzz. A second rater verifies a 20% stratified sample to catch false positives.
Will the dataset be open?: Yes. The query set, tool outputs, and classification labels will be published under CC-BY 4.0 on Zenodo with a DOI, so other researchers can replicate or extend the methodology. The preregistration will go on OSF before the final data collection runs.
Why did you publish preliminary results at all?: Because the fabrication problem affects students and researchers now, not after peer review. We felt the pilot results were directionally clear enough to share with caveats, while making clear this is not yet the final peer-reviewed study. Feedback on methodology is welcome before the full run.
How can I cite this work in a paper?: Use the suggested citation at the bottom of this page (APA, MLA, and Chicago formats provided). Once the Zenodo dataset is deposited, we'll update this page with the DOI and versioned citation so every version is independently citeable.

Preliminary results — peer review pending

Citation Hallucination Benchmark 2026

Why we ran this

Preliminary results

Tool	Fabrication rate	Claim support	Notes
ChatGPT-4o (no browsing)	~23%	~61%	Highest fabrication rate in pilot. Gets worse on niche topics.
Claude 3.5 Sonnet	~14%	~72%	Better than ChatGPT; still hallucinates when confident.
Gemini 1.5 Pro	~18%	~64%	Mixed — sometimes retrieves, sometimes generates.
Perplexity Pro (Academic)	~14%	~78%	Retrieval-first helps; still misattributes claims.
Elicit	~3%	~88%	Retrieval-first design dramatically reduces fabrication.
Consensus	~2%	~90%	Peer-reviewed source prioritisation; very low fabrication.
CiteDash	~0.2%*	~92%*	Pilot number only. Architectural: multi-agent verification pass.

*CiteDash numbers are from internal measurement and will be re-verified by an independent rater in the peer-review pass. Do not cite the CiteDash number without noting this caveat.

Methodology

Query set

The query set is versioned and will be released in full alongside the preregistered study. Anyone can re-run the benchmark on their own tool or model with the same prompts.

Tools tested

Fabrication classification

DOI resolution: every citation with a DOI is resolved via the CrossRef API. If the DOI 404s, the citation is classified fabricated.
Title + author fuzzy match: for citations without DOIs, we search Semantic Scholar, OpenAlex, and PubMed. A paper must match on title substring and at least one author-year combination to be considered real.
Claim-support check: a 20% stratified sample of “real” citations is manually verified by two independent raters. Each rater reads the cited paper’s abstract and decides whether it actually supports the claim the AI attributed to it.
Inter-rater reliability: we report Krippendorff’s alpha on the manual sample. Target: α ≥ 0.8. Below that, we re-rate and discuss.

Reproducibility & open data

Preregistration: the methodology, classification rules, and analysis plan will be locked on OSF before the final data collection runs. Any deviations will be flagged in the published paper.
Dataset: 500 prompts + raw tool outputs + labels + verification notes will be deposited on Zenodo under CC-BY 4.0, with a DOI. Researchers can cite the dataset independently of this writeup.
Analysis code: Python scripts for DOI resolution, Semantic Scholar queries, and label aggregation will be published on GitHub alongside the dataset.
Rater instructions: the manual claim-support rubric will be included in the Zenodo deposit so other teams can replicate the human-verification step.

Limitations

Results reflect the tools’ state as of April 2026. Model versions change frequently; expect the fabrication rates to move with each major release.
The query set skews toward English-language academic topics. Non-English citation performance is not measured.
We measure surface fabrication (does the paper exist?) and surface support (does the abstract match?) but not deeper accuracy issues like whether the AI’s summary is a fair reading of the paper’s findings.
Tool configurations matter. ChatGPT with browsing enabled behaves differently from ChatGPT without. We report each tool in a specified configuration; running it differently will produce different numbers.

How to cite this work

Once the Zenodo dataset is deposited, versioned citations will use the DOI. Until then, use the following provisional forms:

APA 7:

CiteDash Editorial. (2026). Citation Hallucination Benchmark 2026 — Methodology & Preliminary Results. Retrieved from https://citedash.ai/benchmark/citation-hallucination-2026

MLA 9:

CiteDash Editorial. “Citation Hallucination Benchmark 2026 — Methodology & Preliminary Results.” CiteDash, 2026, citedash.ai/benchmark/citation-hallucination-2026.

Chicago:

CiteDash Editorial. “Citation Hallucination Benchmark 2026 — Methodology & Preliminary Results.” CiteDash, April 16, 2026. https://citedash.ai/benchmark/citation-hallucination-2026.

Contact

Frequently asked questions

What is the Citation Hallucination Benchmark?: A structured evaluation of how often leading AI research tools fabricate academic citations when asked to support claims. We run 500 prompts across 10 disciplines on 7 tools (ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Perplexity Pro, Elicit, Consensus, and CiteDash) and measure fabrication rate, claim-support rate, and source quality.
Are the numbers on this page final?: No. The numbers published here are preliminary illustrative results from our internal pilot. The full peer-reviewable study is pending OSF preregistration. Until that preregistration + a third-party reviewer pass are complete, treat the headline numbers as directional, not definitive.
How do you measure a fabricated citation?: A citation is classified as fabricated if (a) the DOI does not resolve via CrossRef, or (b) the paper cannot be matched against Semantic Scholar, OpenAlex, or PubMed by title + authors + year with reasonable fuzz. A second rater verifies a 20% stratified sample to catch false positives.
Will the dataset be open?: Yes. The query set, tool outputs, and classification labels will be published under CC-BY 4.0 on Zenodo with a DOI, so other researchers can replicate or extend the methodology. The preregistration will go on OSF before the final data collection runs.
Why did you publish preliminary results at all?: Because the fabrication problem affects students and researchers now, not after peer review. We felt the pilot results were directionally clear enough to share with caveats, while making clear this is not yet the final peer-reviewed study. Feedback on methodology is welcome before the full run.
How can I cite this work in a paper?: Use the suggested citation at the bottom of this page (APA, MLA, and Chicago formats provided). Once the Zenodo dataset is deposited, we'll update this page with the DOI and versioned citation so every version is independently citeable.

Citation Hallucination Benchmark 2026

Why we ran this

Preliminary results

Methodology

Query set

Tools tested

Fabrication classification

Reproducibility & open data

Limitations

How to cite this work

Further reading

Contact

Frequently asked questions

Citation Hallucination Benchmark 2026

Why we ran this

Preliminary results

Methodology

Query set

Tools tested

Fabrication classification

Reproducibility & open data

Limitations

How to cite this work

Further reading

Contact

Frequently asked questions