Getting the answer right is no longer enough
A growing body of work in artificial intelligence is shifting the focus from whether a model can answer a question to whether it can prove where the answer came from. New research highlighted by The Decoder suggests many leading systems still struggle on that second part.
Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory created a benchmark called CiteVQA to measure both answer accuracy and source attribution in document question answering. Their conclusion is uncomfortable for anyone hoping to rely on AI in high-stakes settings: a model can produce the correct answer and still point to the wrong evidence.
The team calls this failure mode “attribution hallucination.” In practice, that means an AI system may sound trustworthy because its final response is accurate, while the citation offered in support does not actually justify the answer.
Why citation quality matters
Standard document-analysis benchmarks such as DocVQA and MMLongBench-Doc typically grade the final answer. That leaves a major blind spot. A model may have reasoned from the source material, but it may also have guessed from prior knowledge, pattern matching or partial cues in the prompt.
In many consumer uses, that distinction can be overlooked. In law, medicine, finance and auditing, it cannot. The paper argues that traceability is what makes an AI output usable in the first place. If a system cannot reliably identify the paragraph, table or figure that supports its answer, a polished response may still be operationally unsafe.
CiteVQA is designed to expose that gap directly. A page number is not enough. Models are required to identify precise source locations inside the document, down to the specific supporting element.
A harder test than ordinary document QA
The benchmark includes 1,897 questions across 711 PDFs from seven subject areas, with 451 documents in English and 260 in Chinese. The average document length is 40.6 pages, making the set substantially longer than many existing document benchmarks.
Rather than relying on fully manual labeling, the researchers built an automated pipeline. Documents are broken into individual elements, then models trace chains of evidence. The system tests whether each cited component is truly necessary by removing documents one by one and checking whether the model can still answer. If not, that evidence is treated as essential.
The core metric is Strict Attributed Accuracy. Under that scoring, a model gets credit only when both parts succeed: the answer is correct and the citation lands on the right supporting material. A correct answer paired with a wrong citation scores zero.
Top models still lose substantial ground
Twenty current models were evaluated. The best-performing system, Gemini-3.1-Pro-Preview, scored 76 out of 100 on the strict metric. That is strong relative performance, but it still leaves a substantial gap between the best available result and reliable near-perfect attribution.
The benchmark also exposed a notable difference between answer quality and evidence quality. GPT-5.4 reportedly scored 87.1 on raw answer performance, but that dropped to 59 once correct citation was required. In other words, the model often knew what to say without consistently showing where in the document the answer came from.
Open-source systems fared much worse in the reported results. Qwen3-VL-235B-A22B, described as the strongest freely available model in the comparison, reached 22.5. Smaller open models mostly landed below 10. The researchers characterize that level of performance as extremely risky for regulated industries.
Finding the right page is still a major hurdle
One of the clearest messages from the benchmark is that many models struggle even before the finer-grained citation task begins. They often fail to identify the correct page, which makes accurate paragraph- or figure-level attribution even harder.
That matters because users often interpret citations as a built-in safety feature. In reality, a citation format can conceal a weak retrieval step. A system that attaches evidence-looking references to a response may appear more reliable than one that answers without references, even if the evidence is wrong.
CiteVQA suggests the industry should be more careful about treating source-linked output as inherently trustworthy. Attribution has to be measured, not assumed.
A benchmark aimed at practical trustworthiness
The study’s significance is less about declaring one model the winner and more about redefining the target. If AI is going to be used for professional reading, compliance review, due diligence or evidence-based assistance, the bar cannot stop at fluent summaries and mostly correct answers.
What matters is whether a model can retrieve the exact support it claims to be using. The benchmark makes that visible and quantifiable. It also shows that current systems, including top-tier ones, remain uneven on this front.
That does not mean document AI is unusable. It does mean deployment decisions should distinguish between “answering well” and “grounding well.” CiteVQA frames those as separate capabilities, and the results suggest the second is still lagging.
For enterprise buyers, regulators and teams building AI into research workflows, that is likely the main takeaway. The next competitive frontier in document intelligence may not be producing more confident prose. It may be proving, with precision, that the prose is anchored to the right line in the right source.
This article is based on reporting by The Decoder. Read the original article.
Originally published on the-decoder.com


