Why ChatGPT Makes Up Citations: The Technical Reason LLMs Hallucinate Sources
A technical explanation of why ChatGPT fabricates academic citations: token prediction, training data gaps, RLHF artifacts, and why retrieval changes everything.
Every piece on AI citation hallucination eventually reaches the question: why. Why does a model that can write a passable undergraduate essay produce reference lists where half the citations are invented? Why does the invention look so plausible? And why has OpenAI, with all its engineering resources, not simply fixed it?
The answer is architectural. Hallucination is not a bug in ChatGPT. It is the expected behaviour of a pure token-prediction system asked to produce content whose correctness depends on retrieval. Our foundational piece on ChatGPT fake citations covered what this looks like from a user's perspective. This piece is the technical explanation for researchers who want to understand why the problem is so persistent — and what actually fixes it.
How Large Language Models Generate Text
At the core of every large language model is a deceptively simple operation: predict the next token, given the tokens that came before. A token is roughly a word or a word-fragment — "research" is one token, "retrieval-augmented" is three tokens, and a punctuation mark is its own token.
The model sees the prompt plus any output produced so far, and assigns a probability distribution over the next possible token in the vocabulary. Then it samples (or greedily picks) one token from that distribution, appends it to the output, and runs the process again. That is the entire generation loop. A 2,000-word essay is the same loop run a few thousand times.
Two consequences of this architecture matter for citations.
There is no fact database inside the model
A language model does not store discrete facts as retrievable records the way a traditional database does. What it has, instead, is a distributed representation — a vast matrix of weights — that collectively encodes the statistical patterns of its training corpus. When the model generates "Einstein published general relativity in 1915," it does not look up a record for Einstein; it produces a sequence of tokens that happens to be the most likely continuation given the context. The factual correctness of the output is emergent from the training distribution, not guaranteed by the architecture.
For well-represented facts (Einstein, 1915, general relativity), the emergence is reliable. For poorly-represented facts, or for facts that require combining multiple pieces of information the training data did not co-locate, the emergence fails silently. The model still produces output. The output is still plausible. It is just no longer reliable.
Every token is generated the same way, whether it is real or invented
The token-by-token generation process is indifferent to truth. The model does not know, while producing a DOI, whether the DOI corresponds to a real paper. The generation of a real DOI and the generation of a fake DOI are literally the same operation — sampling from a probability distribution shaped by training data. There is no internal flag that distinguishes "I am retrieving this from a reliable memory" from "I am confabulating this to match the pattern."
This is why the hallucination-detection problem is so hard for the model itself to solve. Asking ChatGPT "are you sure this citation is real?" sometimes produces a retraction, but the retraction is itself a generated output. The model does not actually have privileged access to its own confidence about specific facts; the appearance of self-correction is another pattern-match.
Why Citations Look Real But Aren't
A citation has a characteristic structure: author(s), year, title, journal, volume, pages, DOI. Each of those fields follows a recognisable format. The model has seen millions of citations in this format during training. When it generates a citation, it is generating text that matches the format — and the format is the easy part.
The content is the hard part, and the content is where hallucinations happen. Consider what generating a citation actually requires the model to produce:
- Author names plausible for the field.
- A paper title that sounds like a real paper in that sub-field.
- A journal name that publishes on that topic.
- A year consistent with the state of the literature.
- A volume and page range consistent with that journal's publication schedule.
- A DOI in the
10.xxxx/xxxxxformat.
The model learns each of these distributions from training data. It can produce a convincingly field-specific author name, because it has seen which names cluster with which topics. It can produce a journal name that is real and relevant, because it has seen that journal mentioned in similar contexts. It can produce a plausible-looking DOI, because it has seen the DOI format thousands of times.
What the model cannot do, in pure token prediction, is guarantee that this specific combination of author-title-journal-year-DOI corresponds to a real paper. Each element is individually plausible. The joint plausibility is illusory. The model is sampling each field from the local distribution given the preceding context, not retrieving a linked record from a paper database.
A fabricated citation is therefore not a random jumble. It is a coherent piece of writing in the citation genre, populated with field-appropriate content, that happens not to correspond to a real publication. That is exactly why it fools readers. The surface features of a real citation are the features the model is best at reproducing.
The Role of Training Data
Training data composition determines which citations the model can produce correctly. Three patterns in training data drive the observed hallucination behaviour.
Over-representation bias
Highly-cited papers are over-represented in training data, because every citation to them in another paper adds to the distributional signal. When you ask about a well-known foundational paper in cognitive psychology, the model often produces the correct citation, because the correct citation has appeared thousands of times in its training corpus in approximately its correct form.
For a paper that has been cited three times in total, or for a preprint that was released after training cutoff, the model has almost no signal to reconstruct the correct citation. It fabricates, because fabrication is the only operation its architecture supports in the absence of signal.
Recency failure
Training data has a cutoff date. Papers published after that cutoff are not in the training distribution, and the model cannot produce correct citations to them except by coincidence. Newer models push the cutoff forward, but the problem does not go away — every model has a recency horizon beyond which its citation output becomes almost entirely fabricated.
Metadata gaps
Even for papers that are in training data, the metadata may not be. A paper might be extensively cited in other papers' prose but with inconsistent reference formatting — different volume numbers, missing DOIs, varying author-order conventions. The model learns the paper's content but not a canonical reference format for it. When asked to cite it, the model fills in the metadata fields from the local distribution rather than from a consistent record, and the filled-in fields are often wrong.
This is why real papers often appear with wrong DOIs, wrong volume numbers, or wrong page ranges in AI output. The paper exists. The model knows it exists. But the model does not know the exact metadata, so it invents plausible metadata — and plausible metadata is, most of the time, wrong metadata.
RLHF and the "Helpful Illusion" Problem
Modern chatbots are not purely the raw language models that come out of pre-training. They are fine-tuned using Reinforcement Learning from Human Feedback (RLHF), a process where human raters score model outputs and the model is trained to produce output that scores well.
RLHF makes models noticeably more pleasant to interact with. It also introduces a specific failure mode relevant to citations: the model is rewarded for being helpful, and "I do not know" is rarely scored as helpful.
Rewarded behaviours
Human raters, shown two model outputs, tend to prefer the one that confidently answers the question over the one that hedges. This preference is not unreasonable — it reflects a genuine use case where users want direct answers. But when aggregated over millions of training examples, it shifts the model toward confident output as a default.
For citation queries, this means the model has been trained to produce a citation even when its internal distribution over possible citations is diffuse — that is, even when it does not actually know. A pre-RLHF base model might be more willing to produce "I am not confident in the specific reference, but the relevant literature includes..." The RLHF-tuned model, in contrast, has learned that such output scores lower and produces a specific-looking citation instead.
The honesty-helpfulness tension
OpenAI, Anthropic, and Google are all aware of this tension. All three have adjusted their training to push toward more honesty and more willingness to refuse or hedge. The progress is real — Claude 3.5 Sonnet, in our benchmark, was more willing than ChatGPT-4o to respond with "I cannot verify this specific citation" — but none of the adjustments eliminate the underlying dynamic. The reward signal for helpfulness is still there, and the model still tilts toward confident output on average.
This is why users who ask ChatGPT "are you sure this citation is real?" sometimes see it retract and sometimes see it double down. The retract-or-double-down behaviour is itself shaped by RLHF, not by any actual privileged access to truth.
Why Web-Browsing Modes Help (But Don't Fix It)
OpenAI, Anthropic, and Google have all added retrieval-augmented generation (RAG) capabilities to their flagship chatbots. ChatGPT can browse the web. Claude can search. Gemini has Google Search integration. Does this fix the citation problem?
Partially, and in specific ways. Not completely, and it would not completely fix it even if retrieval were perfect.
What RAG does
In a RAG setup, before the model generates a response, a separate retrieval step pulls in real documents relevant to the query. Those documents are loaded into the model's context window. The generation step then conditions on both the prompt and the retrieved documents, which dramatically increases the probability that the output reflects real sources.
For citation queries specifically, RAG helps because it surfaces actual paper metadata into the context. If the retrieval finds a real paper with a real DOI, the model is much more likely to cite that paper correctly than to fabricate an alternative.
Why RAG does not fully fix it
Three failure modes persist even with retrieval:
Retrieval miss. If the retrieval step does not find a relevant paper — because the query was too obscure, the right database was not searched, or the returned results were off-topic — the generation step falls back on its training-data patterns. And the fallback is the original hallucination problem, unchanged.
Generation drift. The model can invent details that were not in the retrieved documents. It might cite a retrieved paper correctly for the first claim, then invent a citation for the second claim when no retrieved document supported it. The retrieval step constrains what can be cited correctly; it does not constrain what will be cited.
Mis-retrieval. Retrieval can return documents that seem relevant to the query but do not actually support the specific claim the model is about to generate. The model cites the retrieved document anyway, producing a Type-2 hallucination — real source, wrong claim. See our verification workflow for how to catch this.
RAG is a real improvement. It is not a full solution. The numbers in our benchmark bear this out: Gemini 1.5 Pro with Search had a lower fabrication rate than Claude 3.5 Sonnet without search, but still a materially non-zero one. Retrieval alone, bolted onto a chatbot, is not the endpoint.
The Retrieval-First Alternative
The architectural pattern that structurally eliminates citation fabrication is not retrieval-augmented generation but retrieval-first generation. The distinction matters.
What retrieval-first means
In a retrieval-first system, the retrieval step is load-bearing. The generation step is constrained to operate only on retrieved content. The system does not have a fallback path where the generator can invent a citation if retrieval fails — if retrieval fails, the generator reports no result or asks for a re-query.
Tools built this way include Elicit, Consensus, and CiteDash's deep research pipeline. The operating principle in all three is that the generator cannot cite what retrieval did not return. There is no generator-only mode. The constraint is architectural, not a preference that can be overridden under pressure.
Post-generation verification
On top of retrieval-first generation, the best academic AI pipelines add a post-generation verification step: a separate model checks, after generation, that every citation in the output corresponds to a retrieved source and that every claim is supported by the cited source. CiteDash's Reviewer Agent is this layer.
The combination — retrieval-first generation plus post-generation verification — is what closes the hallucination failure modes that RAG alone leaves open. A retrieved source that the writer used correctly passes verification. A retrieved source that the writer used incorrectly gets flagged. A citation that was not actually retrieved cannot be produced at all.
Why general-purpose chatbots do not adopt this architecture
The short answer is that general-purpose chatbots are optimised for a different task. ChatGPT needs to answer a wide range of queries — coding, creative writing, math problems, factual questions, brainstorming — and most of those queries do not benefit from a retrieval-first constraint. A retrieval-first architecture would make ChatGPT worse at writing a poem.
What this means in practice is that the general-purpose product is structurally not the right tool for citation-dependent work. This is not a criticism of ChatGPT; it is a statement about tool fit. For citation-dependent work, you want a tool whose architecture is aligned with the task. For brainstorming, a general-purpose chatbot is fine. The categorical distinction is what readers should take away. See our citation generator for a lightweight retrieval-first option for single-citation lookup.
The Road Ahead
Hallucination rates will continue to drift down over the next few years as models get larger, training data gets cleaner, RLHF gets more nuanced, and retrieval integration gets tighter. We expect to see ChatGPT's fabrication rate on common queries fall meaningfully. We expect to see bigger improvements on obscure queries as retrieval coverage improves.
What we do not expect, short of an architectural shift, is a world in which general-purpose chatbots can be trusted as primary citation sources for academic work. The token-prediction architecture is not going to be replaced by something fundamentally different inside a commercial chatbot any time soon. Hallucination rates will go down. They will not hit zero.
The long-run equilibrium in academic AI is therefore likely to be a market with two product categories: general-purpose chatbots, used for brainstorming and writing assistance, with known hallucination rates that are improving but not zero; and purpose-built retrieval-first academic tools, used for anything where citation integrity is load-bearing, with structurally near-zero fabrication. Users who understand the distinction will use both. Users who do not will get burned.
If you want the comparative numbers, see our 2026 benchmark. If you want the user-side verification workflow, see how to detect AI hallucinations. If you are a faculty member writing a course policy, see our academic integrity guide for how to translate the architectural distinction into enforceable rules.
The deeper point is simple. When someone asks "why does ChatGPT make things up," the temptation is to say "because it is a bug." That framing is wrong, and more importantly it is misleading about what the fix looks like. ChatGPT makes things up because it is a token-prediction system, token-prediction systems fill gaps with plausible inventions, and plausible inventions are indistinguishable at the output level from real content. The fix is not a better token-prediction system. The fix is an architecture in which retrieval is load-bearing and verification is independent. That is a different product. Once you see the distinction, most of the confusion in the public conversation about AI hallucinations resolves.
Understanding the architecture is not just academic. It tells you which tools to use for which tasks, which failure modes to watch for, and why the fixes that look plausible in press releases often fall short in practice. In 2026, that understanding is part of literate AI use.
Related reading
The 2026 AI Citation Hallucination Benchmark: ChatGPT vs Claude vs Perplexity vs Elicit
A cross-tool benchmark of citation fabrication rates across ChatGPT-4o, Claude 3.5 Sonnet, Gemini 1.5 Pro, Perplexity, Elicit, and Consensus. Preliminary results.
ChatGPT Fake Citations: Why AI Hallucinations Matter for Research
ChatGPT fabricates citations that look real but don't exist. Learn why this matters for academic research and how to verify AI-generated references.
Perplexity vs ChatGPT for Research: Which Is Better in 2026?
An honest comparison of Perplexity and ChatGPT for research tasks. Citation accuracy, depth, pricing, and when to reach for each one.