Tutorial teaches PyTorch builders to read torch.profiler traces and map Python calls to CUDA kernels

News

5/30/2026, 9:36:47 AM

Tutorial teaches PyTorch builders to read torch.profiler traces and map Python calls to CUDA kernels

A beginner’s guide to torch.profiler was published on May 29, 2026 to help engineers bridge the skills gap between collecting traces and actually reading them. The tutorial presents a minimal, runnable example and explains why profiling matters — improving tokens/second for LLMs, shaving milliseconds off inference latency, and diagnosing training loops that underperform specifications — so teams can turn traces into concrete performance fixes.

The walkthrough begins with the simplest operation: def fn(x, w, b): return torch.add(torch.matmul(x, w), b. Authors provide a runnable script named 01_matmul_add.py and recommend annotating the operation with torch.profiler.record_function("matmul_add") so the event appears clearly in traces. The examples were executed on an NVIDIA A100 — SXM4-80GB and wrapped in torch.profiler.profile configured with activities=[ProfilerActivity.CPU, ProfilerActivity.CUDA] to capture both sides of the pipeline.

Practically, the guide focuses on what readers need to do and recognize: how to set up torch.profiler to generate traces and tables, how to read the profiler table and the visual trace lanes, and how to identify the CPU lane, the GPU lane and the latency gaps between them. It also traces the full event chain from a Python call down to one or more CUDA kernels, making explicit the mapping between high‑level code and low‑level GPU work.

The post highlights how most traces appear as dense walls of events and explains common navigational techniques — annotating high‑level operations, running events repeatedly to warm up hardware, and capturing both CPU and CUDA activities to observe scheduling and kernel launch behavior. It offers concise operational rules: warm up GPUs with multiple iterations, annotate meaningful regions of code, and inspect both lanes to diagnose scheduling delays or inefficient kernel usage. a GPU kernel is a program running in parallel on GPU threads, while the CPU is responsible for scheduling and launching those kernels.

The authors flag what does and does not change when using torch.compile, helping teams decide where compiler‑level optimizations will affect runtime behavior versus where tracing still reveals the same bottlenecks. They position this piece as the opening installment of a multi‑part series: Part 2 will move from the isolated matmul+add to nn.Linear and a small MLP to motivate optimizations and inspect kernels beneath operators; Part 3 will apply the same diagnostic workflow to Large Language Models implemented with transformers.

The tutorial is aimed at readers with basic PyTorch familiarity who want to use profiling traces to drive concrete performance work.

Sources

Hugging Face Blog · 5/29/2026

Replies (0)

No replies in this topic yet.

Back