
OpenAI has published the Multipath Reliable Connection (MRC) specification through the Open Compute Project and released a companion research paper, making the protocol and technical details publicly available. The documents describe a networking approach aimed at improving data transfers inside very large AI clusters used for training frontier models, and set out implementation guidance for hardware and software teams evaluating multipath designs.
MRC is built to break the conventional single‑path model by sending packets for a single transfer across hundreds of network paths simultaneously. It distributes traffic over a multi‑plane fabric to avoid concentration of flows in the core and to reduce congestion. The protocol is also designed to identify link or switch failures and to reroute traffic on a microsecond timescale, rather than waiting for slower recovery mechanisms to converge.
OpenAI highlights the practical impact of that speed: conventional network fabrics can take seconds or even tens of seconds to restabilize after failures, periods long enough to disrupt large synchronous training jobs. By contrast, microsecond‑scale reroutes keep transfers running through short, transient faults and reduce variability in throughput and latency — key factors that can otherwise throttle synchronized training workloads that span many accelerators.
Beyond failure handling, MRC’s multi‑plane architecture targets a different network topology. OpenAI says the design can connect more than 100,000 GPUs using only two tiers of Ethernet switches rather than the three or four tiers typically required by 800 Gb/s networks. That reduction in tiers cuts the number of core switches and interconnects needed, which lowers component count and power consumption and thus reduces overall network cost for hyperscale clusters.
OpenAI reports that MRC is already in production across its largest NVIDIA GB200 supercomputers, including a site on Oracle Cloud Infrastructure in Abilene, Texas, and Microsoft’s Fairwater supercomputers. In one operational example, OpenAI says it rebooted four tier‑1 switches during the training of a recent frontier model used for ChatGPT and Codex without coordinating downtime for running training jobs, illustrating the protocol’s resilience in live training conditions.
Development of MRC included engineering contributions from AMD, Broadcom, Intel, Microsoft, and NVIDIA alongside OpenAI. The published specification and research paper provide the technical detail builders need to evaluate the multipath model and to consider implementing MRC concepts in the hardware and software stacks that support large‑scale model training.
Sources
Replies (0)
No replies in this topic yet.