Google introduced Virgo Network — a revolutionary network architecture designed for hyperscale data centers. This innovation will become the core for the company's own AI supercomputer, designed to provide the necessary infrastructure for the exponential growth of machine learning over the next decade. Modern foundational AI models demand unprecedented performance, and traditional general-purpose networks can no longer cope with the increasing number of parameters and data volume.
Virgo Network addresses this problem by offering a scalable network built on highly radial switches, which significantly reduces the number of network layers by increasing the number of ports on each switch. It utilizes a flat two-layer non-blocking topology, which significantly reduces latency. Additionally, the architecture includes a multi-planar design with independent control planes for accelerator connectivity and is integrated with the Jupiter network for efficient "north-south" traffic processing, providing a comprehensive solution.
The era of artificial intelligence demands a fundamental rethinking of physical cloud infrastructure, especially in terms of networking technologies. Existing network solutions are unable to effectively handle the challenges characteristic of modern AI workloads: massive scaling, explosive bandwidth growth, synchronized traffic bursts, and strict low-latency requirements. The Virgo architecture is purposefully designed to address these critical issues, providing massive bisectional bandwidth and deterministically low latency. These characteristics are vital for both efficient distributed training of large-scale models and their high-performance serving. The strategic separation of the architecture into independent control planes also offers significant advantages, allowing each network segment to be developed and updated autonomously. This approach significantly accelerates the innovation cycle and adaptation to the constantly evolving demands of AI.
The Virgo architecture is organized into three specialized tiers that function as a single high-performance computing domain, ensuring seamless interaction. The first tier is a high-performance network optimized for tightly coupled communication between accelerators within a single "pod" — a basic unit for placing computing resources. The second tier is a dedicated RDMA network specifically designed for horizontal scaling and efficient interaction between different pods, which is critical for training ultra-large models.
The third, outer tier is the high-capacity Jupiter network, providing reliable and fast access to distributed data storage and shared computing resources. A key aspect of Google's success is its co-development approach: the Virgo network infrastructure is created in close conjunction with each new generation of machine learning hardware accelerators. This ensures optimal alignment of the network architecture with the supported hardware, providing maximum performance even for the most demanding AI workloads and fostering future innovations in AI.
Sources
Replies (0)
No replies in this topic yet.