Fact-Checkers Warn AI Answers Are Often Wrong, Forcing Continued Human Review

News

5/27/2026, 1:44:02 AM

Fact-Checkers Warn AI Answers Are Often Wrong, Forcing Continued Human Review

Fact-checkers and studies find large language models and AI search overviews frequently produce inaccurate answers, prompting newsrooms and builders to keep humans in the loop and to run regular benchmarks and audits.

Fact-checkers say large language models and AI-powered search overviews are frequently incorrect, a problem that is forcing newsrooms, verification teams and product builders to preserve human review to avoid amplifying errors. This matters as roughly half of Americans report using AI to find information or generate ideas, increasing the risk that plausible — sounding but false outputs will spread. A veteran fact-checker describes routine interactions that illustrate the gap between plausibility and accuracy: when asked about its sources, a chatbot returned a recipe for vegan cream cheese — an answer that sounded credible but was irrelevant and unverified. The writer argues these models largely repackage human — created content and can mislead when factual truth is at stake.

Organizations fighting misinformation are using AI to process massive volumes of content and to surface candidate claims for review. One initiative built tools that analyze social posts, comments and podcast transcripts to flag specific assertions; those tools are now in use in more than 40 countries. Mark Frankel, the initiative’s head of public affairs, underscores the limits of automation: "You definitely need a human being."

Several recent studies and benchmarks quantify the scale of errors. Since 2018 nearly 17,000 arXiv papers have examined large language models and their reliability. A March 2025 Tow Center study found more than 60 percent of responses from AI-powered search engines were inaccurate, while a BBC study placed chatbot error rates near 45 percent. One professional fact-checker says the AI Overviews that search engines return are essentially unusable about a third of the time.

On formal benchmarks, models show mixed performance: Claude scored about 73 percent accuracy on RealFactBench, while Grok was not assessed in that evaluation. OpenAI’s SimpleQA, released in October 2024, posed over 4,000 single — answer questions to measure model performance. Those error rates have concrete implications for how fact-checking and publishing operate. The writer contrasts an "old-school" fact-checking workflow — meticulous line-by-line annotations, tracking primary sources and layered legal and ethical review — with a faster, peer-review-style verification designed to keep up with news cycles. AI can accelerate identification of potentially false claims, but it has not supplanted the detailed human work needed to confirm context, sourcing and legal exposure.

For product teams and builders the practical takeaway is to treat AI as a triage and discovery layer, not an authoritative verifier. Systems that ingest AI Overviews or model outputs should route flagged claims into human — in-the-loop pipelines that require primary — source checks, direct outreach and documented decisions. Teams should incorporate routine audits, accuracy benchmarks and legal or ethical review, run tests such as RealFactBench or SimpleQA, log model error types and build rapid human review channels so AI scales detection without replacing human judgment.

Sources

WIRED AI · 5/26/2026

Replies (0)

No replies in this topic yet.

Back