Async GRPO in the Wild

Jan 16 2026

If you’ve ever profiled RL post-training for LLMs and thought “why is the GPU idle so much?”, you’re not alone. This post is a systems tour of the RL post-training pipeline, with special attention on the hard ****constraints ****— places where progress stalls until an upstream dependency resolves. Particularly, I’ll use GRPO terminology throughout: for each prompt, we sample $K$ completions, score them, form a within-group baseline, and compute per-sample advantages.

In modern GRPO/PPO-style post-training, wall-clock is dominated less by the backward pass and more by three constraints in the pipeline: group completion, policy freshness, and KV locality. By the end, you should be able to look at a trace, say which constraint dominates, and relax it with existing open-source offerings or your own solutions.

🌟 Feedback is very welcome as I continue refining this guide, especially pointers to missing systems, papers, or production anecdotes. Reach me on X / LinkedIn, or email me.

🪶 Revision history ✦ Jan 17 2026: Added Slime (Zilin et al., 2025) and Async GRPO from Nemo RL ✦ Jan 16 2026: Published initial draft

Table of content:

The Three Constraints that Dominate Wall Clock

When people say “rollout is the bottleneck”, what they usually mean is: the pipeline has a few gating points where downstream work must wait. In GRPO/PPO-style post-training, as shown in the figure below, three constraints typically dominate time-to-gradient: group completion, policy freshness, and KV locality (Figure below).

The following table provides a high-level overview of these constraints, including their symptoms, and potential ways to relax them (which we will discuss in the rest of this blog).

Constraint	Symptom	Fix
C1: Group Completion	Heavy-tailed output lengths → a few stragglers starve the trainer.	Kill the tail bubbles with load balancing (sync), or group streaming (async)
C2: Policy Freshness	Periodic GPU idling around weight pushes, especially with long-tail generations.	Decouple rollout from training and make freshness a bounded variable (with sync cadence + staleness threshold)
C3: KV Locality (see Notes)	Prefill-heavy traces + high concurrency → KV pressure, fragmentation/thrashing, head-of-line blocking.	Incorporate inference engine features intentionally: prefix caching, chunked prefill, cache-aware scheduling

Notes: In the context of LLMs inference, locality refers to the probability that the KV state needed to execute the next scheduled unit of work (e.g., for a shared prefix or decode continuation) is already resident in a fast, accessible place (ideally the same GPU), so you avoid re-prefill or expensive state transfer.

The Three Constraints that Dominate Wall Clock

Constraint 1: Group Completion