Tilde Research Releases Aurora Optimizer to Fix Muon's Neuron‑Death Flaw

News

5/12/2026, 8:49:14 AM

Tilde Research Releases Aurora Optimizer to Fix Muon's Neuron‑Death Flaw

Tilde Research published Aurora, a leverage‑aware optimizer that addresses a structural failure in Muon that can silence many neurons in tall MLP layers; the release includes open code, a 1.

Tilde Research has released Aurora, a new optimizer designed to correct a structural failure in the Muon algorithm that can silently disable large numbers of neurons during training. The project ships open source code, a 1.1B‑parameter pretraining experiment, and what the team reports as a new state‑of‑the‑art result on a modified nanoGPT speedrun benchmark. Aurora matters because it aims to preserve the numerical precision of Muon’s polar‑factor step while preventing uneven per‑neuron updates that destabilize tall MLP blocks.

Muon’s central operation is computing the polar factor of the gradient matrix. For a thin SVD G = UΣVᵀ, Muon forms polar(G) = UVᵀ and applies the orthogonalized gradient as an update W ← W − η UVᵀ. Practical, large‑scale implementations compute the polar factor using matmul‑only iterative algorithms rather than full SVDs, which preserves efficiency but can introduce subtle per‑row behavior in tall weight matrices.

Tilde’s analysis attributes the failure mode to row‑norm anisotropy in tall weight matrices common in SwiGLU‑based MLP blocks. When strict orthogonalization is enforced, rows receive uneven update magnitudes: some neurons get large updates while others are under‑updated, creating a feedback loop that Tilde calls a ‘death spiral.’ The team reports that by the 500th training step more than one in four neurons can be effectively dead, and that this inactivity propagates to downstream layers and impairs training.

Before Aurora, variants tried to mitigate the issue by loosening orthogonality. NorMuon led prior modded‑nanoGPT speedruns by adding a row‑normalization step-an inverse RMS scaling of the polar factor — that often pulls updates away from strict orthogonality. As an intermediate remedy, the paper proposes U‑NorMuon: instead of normalizing rows to unit norm, it rescales tall‑matrix rows to the mathematically correct average row norm √(n/m) for column‑orthogonal tall matrices. In 340M‑parameter experiments U‑NorMuon outperformed both Muon and NorMuon, eliminated the observed neuron‑death pattern and stabilized leverage scores across layers, but it achieves this by forcibly altering the polar factor.

Aurora is presented to reconcile those tradeoffs. The optimizer is framed as a steepest‑descent formulation under two joint constraints that preserve polar‑factor precision while enforcing more uniform per‑neuron updates, avoiding the forced perturbation that U‑NorMuon applies. For model builders the release serves as both a diagnostic and a practical path forward: monitor row norms and leverage anisotropy in tall MLPs, use U‑NorMuon as a targeted quick fix, and adopt Aurora when a precision‑preserving solution is needed — backed by the team’s open experiments and code.

Sources

MarkTechPost AI · 5/12/2026

Replies (0)

No replies in this topic yet.

Back