Unlocking asynchronicity in continuous batching
This article explores asynchronous batching to optimize LLM inference, focusing on disentangling CPU and GPU workloads. By allowing parallel execution, idle gaps are eliminated, leading to a significant performance boost and better GPU utilization. This method can save a quarter of total runtime without kernel or model changes.
This is the second installment in a series dedicated to optimizing large language model (LLM) inference. The previous article introduced fundamental concepts such as KV cache and FlashAttention, which are built upon here.
Maximizing GPU utilization is crucial given the cost of powerful GPUs like the H200. Continuous batching improves this by scheduling tightly packed batches, but a significant inefficiency remains: the synchronous nature of CPU and GPU operations. This means the CPU and GPU alternate, leading to idle periods where one waits for the other. These idle gaps can account for nearly a quarter of the total runtime.
To address this, asynchronous batching disentangles CPU batch preparation from GPU batch computation, allowing both to run in parallel. This ensures the GPU is continuously engaged in useful work, eliminating idle time.
Synchronous batching's core inefficiency lies in the alternating wait times. While the GPU computes, the CPU is idle, and vice versa. This leads to considerable throughput loss in continuous batching, which processes hundreds of steps per second.
Profiling reveals that in synchronous batching, CPU and GPU activity never overlap. A significant portion of generation time—up to 24%—is spent with the GPU idle, waiting for the CPU. Eliminating this overhead could lead to a substantial speedup without any changes to kernels or models.
The simple idea of running batch preparation for batch N+1 while batch N is computing presents technical challenges. However, implementing asynchronous batching, as in the transformers library, offers a powerful solution through careful coordination of hardware.
Related articles
The AI world is getting ‘loopy’
AI models are taking a significant leap forward with the adoption of "agentic loops," where AI agents continuously prompt each other to improve code and solve complex problems. This approach, though potentially resource-intensive, promises to unlock new levels of autonomous problem-solving and efficiency in AI applications.
Codex-maxxing for long-running work
Codex is increasingly being used by organizations to support long-running projects that go beyond a single prompt. This whitepaper by Jason Liu offers practical strategies for leveraging Codex as a persistent workspace, managing complex workflows and sustaining progress.
Nobel laureate John Jumper is leaving DeepMind for rival Anthropic
Nobel laureate John Jumper is departing Google DeepMind to join its competitor, Anthropic, after dedicating nearly nine years to DeepMind, where he led the AlphaFold team. Jumper, who shared a Nobel Prize for his work on AlphaFold, expressed gratitude for his time at DeepMind while looking forward to new endeavors.
