Perplexity Integrates CuTeDSL into ROSE to Speed GPU Inference on NVIDIA Hopper and Blackwell

News

5/7/2026, 12:02:14 AM

Perplexity Integrates CuTeDSL into ROSE to Speed GPU Inference on NVIDIA Hopper and Blackwell

Perplexity Research has integrated CuTeDSL kernels into its in‑house ROSE (Runtime — Optimized Serving Engine) inference engine to produce custom GPU kernels aimed at raising model throughput and efficiency on NVIDIA Hopper and Blackwell GPUs. The change is intended to shorten the cycle from model exploration to production deployment by enabling teams to implement, tune, and ship inference kernels tailored to state‑of‑the‑art transformer architectures.

ROSE serves as a general inference engine across Perplexity’s products and presents an interface compatible with the NVIDIA Triton Inference Server. Originally built to run custom Llama variants for decoding and classification, ROSE now supports large LLMs as well as ranking, classification, scoring, and other transformer‑based workloads through a common runtime. At the system level ROSE supplies reusable engines that manage request scheduling, batching and chunking, device initialization, inter‑process and inter‑node communications, and weight loading. For LLMs the engine handles batching, chunking, sampling and key‑value (KV) storage allocation for both full and linear attention patterns as well as prefix matching, while embedding engines implement simpler online batching semantics.

The engine’s custom layer collection wraps a mix of GPU kernels implemented with CuTeDSL, Triton, CUDA, CUTLASS, and cuBLAS. Perplexity continues to rely primarily on NVIDIA’s CUTLASS and cuBLAS kernels for matrix multiplication and attention primitives, whereas embedding, normalization, mixture‑of‑experts (MoE) routing/dispatch/combine, and activation kernels are part of ROSE’s own library and require heavier specialization to reach peak performance.

Perplexity says it shifted to CuTeDSL because many inference kernels are mathematically straightforward but require very precise control over hardware execution to hit performance targets. CuTeDSL is a Python‑based domain‑specific language built on CuTe layout algebra and MLIR; it compiles just‑in‑time to optimized PTX while preserving low‑level control. The team moved from an initial Triton‑centric implementation and now treats CuTeDSL as its primary GPU programming environment for custom kernels.

To cover a broad range of models and hardware trade‑offs, ROSE uses compile‑time specialization for variations such as hidden dimensions, RMS‑norm configurations (weight‑bias vs no‑bias), and activation choices. The CuTeDSL integration is positioned behind APIs like Sonar, Search, and Embeddings so researchers can iterate on architectures and quickly promote top candidates to production, while still leveraging NVIDIA kernels where they remain the best fit.

Sources

Perplexity Research · 5/6/2026

Replies (0)

No replies in this topic yet.

Back