Yumo Xu | Website | LinkedIn | X
Dec 1 2025
CISPO (Clipped IS-weight Policy Optimization) is the RL algorithm behind recent long-context reasoning models like MiniMax-M1 and M2. It also shows up in Meta’s ScaleRL recipe from The Art of Scaling Reinforcement Learning Compute for LLMs, which reports higher asymptotic performance and better compute efficiency than GRPO/DAPO-style baselines in large-scale RL.
So how does CISPO work? MiniMax-M1 describes it with a claim that sounds almost too simple:
“Rather than clipping the token updates as in PPO/GRPO, we instead clip the importance sampling weight….”
At first glance this sounds like a naming trick. PPO also “clips the importance sampling ratio”, right? Why are these considered fundamentally different? This post is a walk-through of what’s actually going on, but from the gradient’s point of view, where the important difference lives.
Table of content:
In LLM RL algorithms, we often optimize a sequence model token-by-token. For a question $q$, the policy $\pi_\theta$ generates a response $o_i = (o_{i,1}, \dots, o_{i,T})$. We define:
Importance sampling (IS) ratio per token:
$$ \textcolor{magenta}{r_{i,t}(\theta)} =
\frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})} {\pi_{\theta_{\text{old}}}(o_{i,t} \mid q, o_{i,<t})} $$
Advantage (in PPO), or group-relative advantage (in GRPO) per token: $\hat A_{i,t}$.
Ignoring baselines and KL terms, the generic policy gradient has the shape:
$$ \nabla_\theta J \propto \sum_{i,t} w_{i,t} \nabla_\theta \log \pi_\theta(o_{i,t} \mid q, o_{i,<t}), $$
where $w_{i,t}$ is some weight derived from $\textcolor{magenta}{r_{i,t}(\theta)}$ and $\hat A_{i,t}$. Everything about credit assignment, i.e., which tokens really move the model, is encoded in these $w_{i,t}$.
In vanilla off-policy REINFORCE, $w_{i,t} = \textcolor{magenta}{r_{i,t}(\theta)} \hat A_{i,t}.$ You correct off-policy sampling with the IS ratio $\textcolor{magenta}{r_{i,t}(\theta)}$, and use whatever advantage estimate you like. High variance, but no additional structure.
In the rest of this blog, we will show how PPO/GRPO employs the concept of trust region for training stability (Section 2), which indirectly causes under-training issues of fork tokens (Section 3). To address this, CISPO goes back to a REINFORCE-style objective with more twists on clipping (Section 4).