
UC Berkeley’s UCCL team has released mKernel, a library of persistent CUDA kernels aimed at shrinking the growing cost of inter‑GPU communication in production AI. The authors cite measurements showing communication can consume 43.6% of forward pass time, 32% of end‑to‑end training time, and as much as 47% of execution in popular Mixture‑of‑Experts (MoE) models, making transfers a clear scaling bottleneck for large workloads. That scale sensitivity is the problem mKernel is explicitly built to address.
Rather than relying on host‑driven collectives, mKernel embeds communication logic inside GPU‑resident persistent kernels so compute and transfers can overlap at tile and chunk granularity. The kernels implement intra‑node NVLink operations, GPU‑initiated RDMA for inter‑node transfers, and dense compute logic in fused forms. Cooperative thread arrays (CTAs) self‑assign roles — compute, intra‑comm, inter‑send, inter‑reduce — with a tunable SM allocation to balance work across streaming multiprocessors.
The release positions this GPU‑centric approach as a response to limits of the standard host orchestration model, where CPUs coordinate collectives via libraries such as NCCL or NVSHMEM while compute and communication run on separate CUDA streams. UCCL’s authors argue that host control fails to scale with modern GPU performance: they point to high‑density racks like a GB300 NVL72 configuration (72 Blackwell Ultra GPUs, 36 Grace CPUs) capable of very large TFLOP/PFLOP counts and around 130 TB/s of intra‑rack NVLink, where microsecond‑scale host overheads introduce visible pipeline bubbles.
For practitioners, mKernel promises finer‑grained overlap and fewer staging round‑trips: tiles are produced and sent as they are computed so downstream reductions or grouped GEMMs can begin immediately. MoE token dispatches are routed and processed on‑GPU without an intermediate CPU‑mediated buffer. The tunable SM split plus GPU‑driven RDMA are intended to support heterogeneous network hardware and to reduce CPU orchestration in multi‑node training pipelines.
The project delivers five concrete fused kernels that map common workflows into this model: AllGather + GEMM (ranks gather shards while a local GEMM consumes arriving tiles), GEMM + AllReduce (compute C = A @ B and push partial outputs into a reduction tree immediately), MoE Dispatch + GEMM (route tokens across all‑to‑all and run grouped GEMMs as tokens arrive), Ring Attention (rotate KV chunks while FlashAttention consumes the previously received chunk), and GEMM + ReduceScatter (compute and forward reduced output tiles to owner ranks without extra staging).
mKernel was evaluated on two 2‑node × 8‑H200 testbeds that differed only by inter‑node fabric: one configuration used AWS EFA/SRD with 16×200 Gb/s EFA capacity per node while intra‑node links used NVLink. The team emphasizes that the RDMA path is GPU‑initiated and implemented from scratch to maximize performance and device heterogeneity, signaling an intent to support varied production interconnects rather than depend on a single vendor stack.
Sources
Replies (0)
No replies in this topic yet.