Benchmark shows leading AI agents default to memorized facts and falter at real‑time web research

News

5/31/2026, 8:10:20 AM

Benchmark shows leading AI agents default to memorized facts and falter at real‑time web research

Researchers at Harbin Institute of Technology and Xiaohongshu used LiveBrowseComp, a 90‑day, time‑limited benchmark of 335 questions, to test 11 AI search agents and found many rely on memorized knowledge rather than live web research — MiniMax M2.

A new study by Harbin Institute of Technology in collaboration with Xiaohongshu finds that several leading AI search agents rely heavily on internal memory rather than conducting fresh web research. The team released LiveBrowseComp, a time‑limited benchmark built from 335 human‑written questions that each depend on at least one fact from the 90 days before creation; the design is intended to reveal whether agents actually consult recent online sources. The result matters because models that appear knowledgeable may simply be regurgitating recent training data, overstating their real‑time research capabilities.

LiveBrowseComp was run alongside earlier BrowseComp tests across eleven agents, including frontier models named in the paper: GPT‑5.4, Gemini 3.1 Pro, Claude Sonnet 4.6, DeepSeek V4‑Pro, Kimi K2.6 and MiniMax M2.5. The benchmark intentionally excluded globally prominent events to reduce training‑data leakage and drew on curated source sets-film databases, game directories, security vulnerability registers and earthquake catalogs — to make requests repeatable and verifiable within the 90‑day window.

In baseline ablations that removed all browsing tools, several models still scored well from parameter memory alone. MiniMax M2.5 solved 44.5% of BrowseComp tasks from memory, and Kimi K2.6 reached 62% on the Chinese BrowseComp‑ZH variant. Those results show that strong performance on time‑sensitive benchmarks can sometimes reflect recent memorization rather than active retrieval or evidence‑based reasoning.

The authors introduce the term intrinsic knowledge dependence (IKD) to characterize this reliance on internalized facts. A second, telling ablation left a search interface available but stripped supporting documents from the index; every model then performed worse than with no tools at all. For example, MiniMax fell from 44.5% to 8.0%, and Kimi dropped from 25.5% to 2.3% on the tested variant, demonstrating that, absent confirmatory evidence, a search interface can actively steer agents away from otherwise correct memory‑based answers.

Path analysis traces the mechanism: more than half of agent queries are driven by the model’s own reasoning rather than by previously discovered hits, and when relevant sources do appear agents incorporate them into final answers less than one‑third of the time. LiveBrowseComp’s 90‑day window, curated sources and controlled ablations (no tools, and tools with stripped indexes) offer a repeatable methodology to separate memorized knowledge from true browsing and to measure how well agents integrate evidence.

Sources

The Decoder AI · 5/31/2026

Replies (0)

No replies in this topic yet.

Back