AutoTTS agent discovers test‑time scaling that cuts compute by ~70% while matching accuracy

News

5/24/2026, 8:26:34 AM

AutoTTS agent discovers test‑time scaling that cuts compute by ~70% while matching accuracy

A cross‑institutional team used AutoTTS and Claude Code to let a coding agent search for test‑time scaling controllers in an offline simulation; the discovered controller matched self‑consistency accuracy on AIME and HMMT while reducing token use by about 70%.

An automated search using AutoTTS and the Claude Code coding agent found a test‑time scaling controller that matches the accuracy of standard self‑consistency on AIME and HMMT while cutting token usage by roughly 70 percent, reducing compute per query sharply compared with a baseline that generates 64 parallel answers and votes on them. That algorithmic win matters because it shows teams can substantially lower inference cost without sacrificing performance on challenging math benchmarks.

The researchers built AutoTTS as an offline search environment that pre‑generates many solution paths from a language model and stores them so candidate controllers can be evaluated without repeatedly invoking the model. Claude Code iteratively reviewed prior proposals, wrote new controller candidates in code, and exposed only one high‑level controller per proposal to limit combinatorial blowup; full logs from each run were used to guide later proposals.

Across four model sizes and the two math benchmarks, the agent’s discovered controller delivered comparable or better accuracy at lower token usage than hand‑written methods, and accuracy remained steady even as compute per query fell dramatically. The team reports these savings held up relative to the standard self‑consistency baseline that samples 64 answers in parallel and aggregates them by voting. Practical search costs were low because the offline simulation lets thousands of controller variants run against stored paths rather than the live model: the full discovery run cost roughly $40 and completed in about 160 minutes. That efficiency suggests immediate savings for builders balancing inference cost against performance when tuning test‑time scaling.

The discovered controller also generalized beyond the original setup: it transferred to a different model (DeepSeek — R1‑Distill — Llama‑8B) and to a non‑math benchmark, GPQA — Diamond, demonstrating that the agent’s policy can carry over across architectures and task domains.

Mechanically, the program coordinates solution paths by tracking how the model’s confidence shifts across rounds: it opens more paths when confidence barely changes, avoids spawning new paths when confidence rises quickly, allocates extra compute to paths that align with the current majority, and prunes diverging paths only after repeated negative evidence. The authors note humans probably would not have designed this exact coordination, and an ablation study shows the result depends substantially on two key design choices within the AutoTTS setup.

Sources

The Decoder AI · 5/24/2026

Replies (0)

No replies in this topic yet.

Back