Researchers Deploy Qwen3 235B on NVIDIA Blackwell GB200 NVL72 Racks to Boost Throughput and Cut Costs

News

5/12/2026, 9:21:11 PM

Researchers Deploy Qwen3 235B on NVIDIA Blackwell GB200 NVL72 Racks to Boost Throughput and Cut Costs

A research team put post‑trained Qwen3 235B into production on NVIDIA GB200 NVL72 racks and reports maintaining model accuracy while serving a portion of live traffic at higher throughput and lower cost. They adapted an in‑house inference stack to exploit Blackwell tensor cores and the rack’s high‑bandwidth interconnects to realize these gains. This configuration aims to benefit deployments of large mixture‑of‑experts models that need higher throughput per dollar.

Each GB200 NVL72 rack is built from 18 nodes, with each node containing two ARM‑based NVIDIA Grace CPUs and four Blackwell GPUs. Each GPU has 180 GB of HBM, yielding 72 GPUs per rack. The GPUs are linked via NVLink and 18 NVLink Switch ASICs that provide roughly 1,800 GB/s of peer bandwidth, while ConnectX‑7 InfiniBand adapters supply up to 400 Gb/s of intra‑ and inter‑rack bandwidth.

The deployment expressly disaggregates prefill and decode roles: InfiniBand carries prefiller→decoder transfers and NVLink is reserved for in‑node and in‑rack communication. The Blackwell hardware produced kernel‑by‑kernel improvements over the prior Hopper generation — more streaming multiprocessors, higher memory bandwidth and lower peer‑to‑peer latencies — traits the team says make GB200 NVL72 racks especially well suited to large MoE models.

To maximize throughput while keeping decoding steady, the team applied distinct parallelism strategies to each role. Prefiller nodes run tensor parallelism with TP=4 and EP=4, sharding the model across four GPUs within a GB200 node and distributing projections and experts across ranks. They rejected wider sharding (for example, an 8‑way all‑reduce) because incremental throughput gains did not justify the added communication and host‑side complexity. The team also accepted a replication tradeoff tied to Qwen3 235B’s architecture: its four key‑value heads force k/v projection replication when eight devices are used, a simpler alternative to more aggressive weight sharding that would have increased coordination overhead.

Several implementation details support the topology and sharding choices. A TransferEngine handles arbitrary, mismatched sharding between prefillers and decoders and supports in‑flight splitting and concatenation of KV caches laid out in HND order (heads leading, tokens contiguous). Assuming an input length of around 6,000 tokens, each of 128 experts receives roughly 6000 * 8 / 128 = 375 routed tokens, producing a dense MoE GEMM that saturates device compute and reduces the need for extra data parallelism.

The authors note that many of the serving techniques they describe for Qwen3 235B apply to planned migrations to Qwen3.5 397B and 122B models, and that linear attention behavior in their measurements parallels full self‑attention sufficiently to inform parallelism planning for those larger models.

Sources

Perplexity Research · 5/12/2026

Replies (0)

No replies in this topic yet.

Back