
NVIDIA published a 4‑bit pretraining methodology centered on NVFP4, a microscaling FP4 format supported natively on Blackwell Tensor Cores, and validated it by training a 12‑billion‑parameter hybrid Mamba‑Transformer on 10 trillion tokens.
NVIDIA demonstrated that a carefully designed 4‑bit training pipeline can run at long token horizons: a 12B hybrid Mamba‑Transformer was pretrained on 10 trillion tokens using a new NVFP4 microscaling format and associated training recipe. The team reports this as the longest publicly documented 4‑bit pretraining run and says downstream MMLU‑Pro 5‑shot accuracy (62.58%) closely tracks an FP8 baseline (62.62%).
NVFP4 changes standard microscaling by using 16‑element blocks with block scale factors encoded in E4M3 and adding an FP32 per‑tensor scale so block amax values map nearer FP4 limits. Compared with prior 32‑element MXFP4 blocks, NVFP4 narrows per‑block dynamic range but increases per‑block precision; NVIDIA notes that at least 6.25% of values in each block are represented near FP8 precision.
On Blackwell hardware NVFP4 GEMMs run substantially faster and smaller: FP4 GEMMs achieve roughly 4× BF16 throughput on GB200 and 6× on GB300 (translating to ~2× and ~3× speedups over FP8), and operand memory footprint is approximately halved versus FP8. NVFP4 support is exposed in NVIDIA’s Transformer Engine, enabling builders to target reduced memory and compute costs when the hardware is available.
The quantization scope and training stabilizers are explicit: only linear‑layer GEMMs (Fprop, Dgrad, Wgrad) run in NVFP4, while embeddings, output projection, normalization, non‑linearities and all attention components remain in BF16 or FP32. Model weights, weight gradients used for microbatch/replica accumulation, and optimizer state are kept in FP32; tensor‑parallel reductions run in BF16.
To avoid early divergence NVIDIA combines four stabilizing elements, each shown necessary by ablation. They keep about 16% of linear layers in BF16 (the first two and final eight of 62 blocks), apply a 16×16 Random Hadamard Transform with a shared random ±1 sign vector to Wgrad inputs, use two‑dimensional block scaling for weights, and employ stochastic rounding on gradients. Randomization effects were minimal at 1.2B but improved convergence on the 12B run.
Sources
Replies (0)
No replies in this topic yet.