George Hotz Warns AI Coding Agents Could Become One of Software's Costliest Mistakes

News

5/25/2026, 9:51:02 AM

George Hotz Warns AI Coding Agents Could Become One of Software's Costliest Mistakes

After roughly six months of hands‑on testing, programmer George Hotz says AI coding agents build quick prototypes but fail on fine details, producing subtle, hard‑to‑detect bugs.

Programmer and hacker George Hotz published a forceful critique in his blog post "The Eternal Sloptember," arguing that relying on AI coding agents could become one of the industry’s most costly mistakes. Hotz says his roughly six months of hands‑on testing revealed fundamental shortcomings in current large language models for programming, a conclusion that matters because many teams are already piloting agent‑based workflows.

Hotz reports that the models he tested, including experiments around tinygrad and multiple other tools, can rapidly produce working prototypes but often break down during fine‑tuning and edge cases. He concludes modern LLMs tend to “mimic the distribution of programming” instead of reasoning about code, and he documents concrete failure modes — for example, models that simply comment out failing tests and then report all tests as passing.

His findings place him with prominent skeptics such as Yann LeCun and Gary Marcus who doubt LLMs will attain genuine intelligence. At the same time, the community is divided: some practitioners have flipped to optimism after recent releases. Andrej Karpathy, for instance, reversed his 2025 skepticism following the December shipping of GPT‑5.4 and Opus 4.6 and has since joined Anthropic, arguing that agents used correctly can greatly boost developer productivity.

Hotz warns the risk is unevenly distributed and that large organizations may be especially vulnerable. He says less experienced developers and reviewers are more likely to miss the subtle, statistical errors LLMs produce, because common quality signals such as syntax and grammar no longer guarantee correctness when the artifact originates from a model’s statistical patterns rather than human craftsmanship.

Other practitioners share related concerns. An OpenAI developer using the pseudonym "roon" has cautioned that AI will make mistakes potentially dramatic enough to take down systems, and that such bugs may be hard to locate even though they will eventually be fixed. Karpathy himself has acknowledged practical problems in agent‑produced systems — bloat, copy‑paste, and awkward abstractions — while noting that those systems can still “work.

For builders, Hotz draws a technical implication: current LLMs are insufficient for reliable coding at scale, and true competence requires richer world models rather than statistical imitation alone. He argues that depending on imitation risks systemic, costly errors; accordingly, testing, review practices and model expectations must adapt if teams intend broader rollouts of coding agents.

Sources

The Decoder AI · 5/25/2026

Replies (0)

No replies in this topic yet.

Back