
Google DeepMind introduced Decoupled DiLoCo — a new distributed training architecture for large language models, designed to operate between geographically distant data centers. The essence of the development is to move away from a fully synchronous scheme, where thousands of identical accelerators must constantly wait for each other. For models of the next scale, such rigid synchronization becomes not only expensive but also fragile: a single local failure can slow down or halt a large portion of the training.
Decoupled DiLoCo divides the training process into separate computational 'islands' — learner units. These groups continue to train locally, while data exchange between them occurs asynchronously. The approach builds on the ideas of Pathways and DiLoCo: the first system provided the infrastructure for asynchronous data flow, while DiLoCo reduced the volume of communications between data centers. The new version combines these properties so that distributed training does not run into global network latencies.
The main practical advantage is resilience to hardware failures. In experiments, Google DeepMind applied chaos engineering: it artificially disabled entire learner units and checked whether the system could continue training. Decoupled DiLoCo maintained the useful work of the remaining clusters and then reintegrated the recovered nodes. In tests with Gemma 4 models, this mode yielded comparable machine learning quality to traditional methods but better tolerated hardware failures.
The second important aspect is network requirements. According to DeepMind, Decoupled DiLoCo requires orders of magnitude less bandwidth than classical synchronous schemes. In one production experiment, researchers trained a 12-billion-parameter model across four regions in the USA, using a channel of approximately 2–5 Gbit/s. This level is closer to existing inter-data center connectivity, rather than specially built ultra-high-speed infrastructure. By combining data exchange with lengthy computational phases, the system avoids blocking waits and trains more than 20 times faster than with conventional synchronization.
For the industry, this is important not just as another training optimization. If such schemes become reliable at a larger scale, laboratories will be able to use disparate computing resources, mix generations of equipment, such as TPU v6e and TPU v5p, and place training closer to available power. Decoupled DiLoCo does not negate the need for quality data and engineering control but demonstrates that the bottleneck for future models is not only the number of chips but also the architecture of the entire training system.
Replies (0)
No replies in this topic yet.