Multi-Teacher On-Policy Distillation: A New Post-Training Primitive

April 29 2026

Intro: Why OPD?

Modern post-training has a see-saw problem: Math RLVR shortens reasoning traces and hurts open-ended writing. RLHF buys preference alignment at the cost of strict instruction following. Tool-use RL drifts away from STEM benchmarks. When every specialization stage trades against the others, shipping one model that holds onto everything becomes difficult.

On-policy distillation (OPD) has emerged as a standard fix. The idea: sample trajectories from the student, then match a teacher's distribution along those rollouts via reverse KL. You get dense, token-level supervision that drops into a GRPO-style training loop almost unchanged. The natural extension is multi-teacher OPD (MOPD): make each capability's strongest checkpoint a teacher and let the student absorb them all at once. Teachers usually share a tokenizer and lineage with the student, so the engineering overhead stays small.

This post walks through four 2026 frontier reports that all converge on MOPD but deploy it differently: MiMo-V2-Flash (Jan), GLM-5 (Feb), Nemotron-Cascade 2 (Mar), and DeepSeek-V4 (Apr):

Final-stage consolidation (MiMo-V2-Flash, GLM-5): MOPD as the last step of post-training.
Mid-pipeline stabilization (Nemotron-Cascade 2): MOPD as a forgetting-recovery step between RL specialization stages.
Scaled-up training regime (DeepSeek-V4): full-vocabulary logits, 10+ teachers, purpose-built infra for teacher scheduling and fault-tolerant rollouts.

After a quick GRPO → OPD primer, we'll go through each in turn and discuss what’s converged, what diverged, and what’s ahead.

Background

Original GRPO

The GRPO defines the following loss to minimize:

$$ \small{ \begin{align*} % \mathcal{J}{{\text{GRPO}}}(\theta) &= -\mathbb{E}{ % x \sim \mathcal{D}, \{y_i\}{i=1}^G \sim \pi{\textcolor{red}{\text{infer}}}(\cdot \mid x; \theta_{\rm old})} \left[ \frac{1}{G} \sum_{i=1}^G \frac{1}{|y_i|} \sum_{t=1}^{|y_i|} \Big[ \min \left( r_{i,t} \textcolor{purple}{\widehat{A}{i,t}}, \text{clip} \left( r{i,t}, 1 - \varepsilon, 1 + \varepsilon \right) \textcolor{purple}{\widehat{A}_{i,t}} \right) \Big] \right] \end{align*} } $$

where the PPO-like importance sampling ratio per token $r_{i,t}(\theta) = \frac{\pi_\theta(y_{i,t} \mid x, y_{i,<t})} {\pi_{\theta_{\text{old}}}(y_{i,t} \mid x, y_{i,<t})}$.

Group-relative advantage per token $\textcolor{purple}{\widehat A_{i,t}}$ is defined as:

$$ \textcolor{purple}{\widehat A_{i,t}} = \frac{R_i − \text{mean}(R_1, . . . , R_G)} {\text{std}(R_1, . . . , R_G)} $$

where $R_i$ is the outcome reward for the $i$th rollout in a group.

<aside> 💡

Note: Without losing generality, in this post we stick to the original GRPO objective for clarity. Nevertheless, common best practices of GRPO such as not using standard deviation in advantage estimation, and different length norm can be easily applied.

</aside>