Yumo Xu | Website | X | LinkedIn
Jan 16 2026
If you’ve ever profiled RL post-training for LLMs and thought “why is the GPU idle so much?”, you’re not alone. This post is a systems tour of the RL post-training pipeline, with special attention on the hard ****constraints ****— places where progress stalls until an upstream dependency resolves. Particularly, I’ll use GRPO terminology throughout: for each prompt, we sample $K$ completions, score them, form a within-group baseline, and compute per-sample advantages.
In modern GRPO/PPO-style post-training, wall-clock is dominated less by the backward pass and more by three constraints in the pipeline: group completion, policy freshness, and KV locality. By the end, you should be able to look at a trace, say which constraint dominates, and relax it with existing open-source offerings or your own solutions.
🌟 Feedback is very welcome as I continue refining this guide, especially pointers to missing systems, papers, or production anecdotes. Reach me on X / LinkedIn, or email me.
🪶 Revision history ✦ Jan 17 2026: Added Slime (Zilin et al., 2025) and Async GRPO from Nemo RL ✦ Jan 16 2026: Published initial draft
Table of content:
When people say “rollout is the bottleneck”, what they usually mean is: the pipeline has a few gating points where downstream work must wait. In GRPO/PPO-style post-training, as shown in the figure below, three constraints typically dominate time-to-gradient: group completion, policy freshness, and KV locality (Figure below).

The following table provides a high-level overview of these constraints, including their symptoms, and potential ways to relax them (which we will discuss in the rest of this blog).
| Constraint | Symptom | Fix |
|---|---|---|
| C1: Group Completion | Heavy-tailed output lengths → a few stragglers starve the trainer. | Kill the tail bubbles with load balancing (sync), or group streaming (async) |
| C2: Policy Freshness | Periodic GPU idling around weight pushes, especially with long-tail generations. | Decouple rollout from training and make freshness a bounded variable (with sync cadence + staleness threshold) |
| C3: KV Locality (see Notes) | Prefill-heavy traces + high concurrency → KV pressure, fragmentation/thrashing, head-of-line blocking. | Incorporate inference engine features intentionally: prefix caching, chunked prefill, cache-aware scheduling |
Notes: In the context of LLMs inference, locality refers to the probability that the KV state needed to execute the next scheduled unit of work (e.g., for a shared prefix or decode continuation) is already resident in a fast, accessible place (ideally the same GPU), so you avoid re-prefill or expensive state transfer.