Asynchronous Continuous Batching cuts LLM GPU idle time by about 24%

News

5/14/2026, 5:04:12 PM

Asynchronous Continuous Batching cuts LLM GPU idle time by about 24%

Rémi Ouazan Reboul, Pedro Cuenca and Aritra Roy Gosthipaty published a technical post on May 14, 2026 proposing an implementation path for asynchronous continuous batching to speed up LLM inference. They position the write — up as the second entry in a series on efficient inference and demonstrate that disentangling CPU batch preparation from GPU compute can eliminate sizable GPU idle gaps-about 24% in their representative profile — improving throughput and reducing cost.

The authors contrast the common synchronous continuous — batching loop with an asynchronous design. In synchronous loops the CPU prepares a batch, transfers inputs to the GPU, the GPU performs a forward pass and sampling, and then the CPU updates states before the next batch can start. That strict alternation prevents CPU and GPU work from overlapping and creates idle GPU time while the CPU handles updates.

Their measured profile generates 8,000 tokens with batch size 32 on an 8B model and reports total generation time of 300.6 seconds, of which 24.0% was GPU idle time waiting for CPU updates. The post underscores the cost sensitivity of such inefficiencies by noting H200 instances cost roughly $5 per hour on Inference Endpoints (≈ $120 per day), so wasted device time translates directly into wasted spend.

The asynchronous approach is conceptually straightforward: prepare batch N+1 on the CPU while the GPU computes batch N so that the two sides work in parallel and GPU busy time approaches 100%. Critically, the authors emphasize this improvement comes from coordination and scheduling at the software level rather than modifying model weights or kernel implementations, meaning teams can apply it to existing hardware stacks.

Turning the idea into a robust system raises three engineering challenges the post addresses in detail: how to launch GPU work while immediately regaining CPU control, how to ensure each side has the data it needs at the right time, and how to prepare a dependent batch N+1 when it relies on tokens sampled by batch N. The paper walks through an implementation designed from first principles to resolve each question.

For builders the practical implication is concrete: continuous batching already reduces padding waste, and making it asynchronous targets the second major inefficiency — CPU/GPU alternation — that can account for nearly a quarter of runtime in tight generation loops. Eliminating those gaps can reduce generation time proportionally (their optimistic projection shortens 300.6s to about 228s) and thus directly improves throughput and cost efficiency.

The post also references prior coverage of building blocks such as KV cache handling, FlashAttention, and attention masks, and provides instrumentation guidance. The authors point to a GitHub update and accompanying scripts to dump CPU/GPU spans so teams can reproduce the profiling graphs and validate asynchronous batching gains in their own inference stacks.

Sources

Hugging Face Blog · 5/14/2026

Replies (0)

No replies in this topic yet.

Back