
Researchers from Peking University and the Shanghai Artificial Intelligence Laboratory released CiteVQA, a benchmark of 1,897 questions tied to 711 PDFs that requires exact paragraph, table, or figure citations.
Researchers at Peking University and the Shanghai Artificial Intelligence Laboratory published CiteVQA, a benchmark designed to expose what they call “attribution hallucination”: models frequently give correct answers while pointing to document passages that do not support those answers. The dataset forces models to provide an exact paragraph, table, or figure as evidence rather than a page number, revealing weaknesses that matter for auditability and compliance in regulated industries.
CiteVQA contains 1,897 questions linked to 711 PDF documents across seven subject areas, with 451 documents in English and 260 in Chinese; the documents average 40.6 pages each. The dataset and its evaluation pipeline were built fully automatically to scale checks of both answer correctness and traceable citations, so every question is tied to a specific document element rather than a coarse location.
The authors introduce Strict Attributed Accuracy (SAA) as the core metric: a model scores points only when its answer is correct and the cited element precisely matches the ground — truth evidence; a correct answer with a wrong citation scores zero. Their automated pipeline breaks documents into elements (paragraphs, tables, figures), has models trace chains of evidence, and then removes documents one by one to identify which pieces are truly necessary to derive the answer. The pipeline includes runs with models such as Gemini 3.0 Flash to trace evidence chains.
Twenty models were evaluated on CiteVQA, and results show a wide gap between raw answer quality and evidence alignment. The top SAA performer was Gemini-3.1 — Pro‑Preview, which scored 76 out of 100. GPT‑5.4 demonstrated a large discrepancy: 87.1 on raw answer quality but only 59 once correct citations were required. The strongest open model, Qwen3 — VL‑235B‑A22B, reached 22.5, while most smaller open models scored below 10, underscoring a substantial divide between commercial and open offerings on citation accuracy.
Models also struggled to locate the correct page or element even when they found the answer. The Gemini 3 series identified the correct page in over 87% of cases, while Qwen3 located it in just under 58%. Tasks that require synthesizing multiple documents reduced recall for Gemini-3.1 Pro‑Preview from about 69% down to 55%. Performance varied by layout and task: academic papers with clean formats produced the highest scores, newspapers and magazines capped even top models near 63 points, and math tasks tended to score better because the evidentiary cues are more explicit.
The paper argues that benchmarks grading only final answers miss a critical traceability requirement: an AI that gives a correct answer but cannot point to verifiable evidence undermines audit trails needed in law, healthcare and finance. CiteVQA provides a systematic test developers and evaluators can use to measure whether models truly “show their work” and to prioritize citation accuracy alongside answer quality.
Sources
Replies (0)
No replies in this topic yet.