Browse latest
Research & PapersHugging Face - Blog · May 14, 2026

Unlocking asynchronicity in continuous batching

This article explores asynchronous batching to optimize LLM inference, focusing on disentangling CPU and GPU workloads. By allowing parallel execution, idle gaps are eliminated, leading to a significant performance boost and better GPU utilization. This method can save a quarter of total runtime without kernel or model changes.

Author: Morein.ai Editorial

This is the second installment in a series dedicated to optimizing large language model (LLM) inference. The previous article introduced fundamental concepts such as KV cache and FlashAttention, which are built upon here.

Maximizing GPU utilization is crucial given the cost of powerful GPUs like the H200. Continuous batching improves this by scheduling tightly packed batches, but a significant inefficiency remains: the synchronous nature of CPU and GPU operations. This means the CPU and GPU alternate, leading to idle periods where one waits for the other. These idle gaps can account for nearly a quarter of the total runtime.

To address this, asynchronous batching disentangles CPU batch preparation from GPU batch computation, allowing both to run in parallel. This ensures the GPU is continuously engaged in useful work, eliminating idle time.

Synchronous batching's core inefficiency lies in the alternating wait times. While the GPU computes, the CPU is idle, and vice versa. This leads to considerable throughput loss in continuous batching, which processes hundreds of steps per second.

Profiling reveals that in synchronous batching, CPU and GPU activity never overlap. A significant portion of generation time—up to 24%—is spent with the GPU idle, waiting for the CPU. Eliminating this overhead could lead to a substantial speedup without any changes to kernels or models.

The simple idea of running batch preparation for batch N+1 while batch N is computing presents technical challenges. However, implementing asynchronous batching, as in the transformers library, offers a powerful solution through careful coordination of hardware.

Read original source

Related articles