On June 1, 2026, Richard Sutton argued that ordinary generative models cannot perform genuine scientific discovery

News

6/2/2026, 1:48:33 AM

On June 1, 2026, Richard Sutton argued that ordinary generative models cannot perform genuine scientific discovery because they lack the ability to evaluate and selectively retain their own outputs;

On June 1, 2026, Turing Award winner Richard Sutton published a critique arguing that conventional generative AI cannot do real science because it cannot evaluate and selectively retain its own outputs. He framed the problem as one of recognition: novel ideas sometimes emerge from these models, but if the system cannot judge their value, those ideas vanish instead of becoming lasting knowledge. For builders and researchers, the implication is immediate: adding runtime evaluation and selection mechanisms is necessary if AI systems are to produce reproducible scientific insight.

Sutton describes how current generative models operate: they learn patterns from massive corpora of examples and produce outputs that resemble their training data. When outputs fall outside the training distribution they are commonly dismissed as hallucinations. He summarizes the practical effect with a quip often applied to the field: “This work is both novel and good. Unfortunately, the parts that are good are not novel, and the parts that are novel are not good,” noting that this diagnosis fits much of today’s generative AI.

To separate genuine discovery from mere generation, Sutton invokes a three — step process present in evolution, the scientific method, and reinforcement learning: variation, evaluation, and selective retention. He argues that language and image models can generate variation but typically do not perform evaluation at runtime. Without testing or selection, novelty appears briefly but cannot be retained, iterated on, or integrated into reliable knowledge.

Sutton points to concrete systems that do include evaluation loops and that he believes demonstrate real creative discovery: AlphaGo (including its famous move 37), AlphaZero’s distinctive chess play, AlphaFold’s protein — structure predictions, AlphaProof in mathematics, Claude Code for programming, and GT — Sophy in simulated racing. What these systems share is a measurable objective — a win probability, a formally checkable proof, a successful program run, or a reward signal — that lets the system select and iterate on better solutions.

He emphasizes that evaluation can be provided by humans — for example, users choosing the best image — but is most powerful when it is an automated, task-specific signal such as checkmate, test-suite pass, formally valid proof, or simulator reward. Sutton adds that language and image models augmented with search, verifiers, tooling, reinforcement learning, or formal validators can therefore become parts of genuine discovery systems, while cautioning that it remains an open question how far that structure scales beyond programming, games, and clearly testable domains.

Practically, Sutton concludes that generative capability alone is insufficient for systems intended to produce lasting scientific insight. Generative models will remain useful for summaries, assistance, and entertainment where imitation suffices, but systems aimed at discovery will need runtime evaluation and selective retention mechanisms — whether human — in-the-loop feedback or precise, automatable checks — to turn fleeting novelty into reproducible knowledge.

Sources

The Decoder AI · 6/1/2026

Replies (0)

No replies in this topic yet.

Back