RL[1/n]
RL RL [1/n]
2026-01-12
On-Policy Distillation
On-Policy Distillation - Thinking Machines Lab
Prompt

RLVR rollout

SFT: logit-distill

- Problem
- OOD: student trained on teacher’s in contexts
- when tested on OOD domain, or situation,
- “diverging ever farther from the states it observed in training”
- imitate the teacher’s style and confidence but not necessarily its factual accuracy.
- OOD: student trained on teacher’s in contexts
- Problem
OPD: On-policy distillation
sample trajectories from the student model and use a high-performing teacher to grade each token of each trajectory.

- per-token reverse KL $$ \mathrm{KL}\left(\pi_\theta || \pi_{\text{teacher}}\right) = \mathbb{E}{x \sim \pi\theta} \Big[ \log \pi_\theta(x_{t+1}\mid x_{1:t})
- \log \pi_{\text{teacher}}(x_{t+1}\mid x_{1:t}) \Big] $$
- reward function: minimizes the reverse KL
Exp
Qwen3-8B-BaseMath Task400k SFT
Method AIME’24 Teacher FLOPs Student FLOPs CE vs SFT-2M Initialization: SFT-400K 60% 8.5 × 1020 3.8 × 1020 – SFT-2M (extrapolated) ~70% (extrapolated) 3.4 × 1021 1.5 × 1021 1× Reinforcement learning 68% - - ≈1× On-policy distillation 70% 8.4 × 1019 8.2 × 1019 9-30× Exp
tulu3internal assistant- We could alternate between phases of fine-tuning on new data and distillation to recover behavior to allow our model to learn and stay up-to-date on knowledge over time.
- https://arxiv.org/abs/2009.04416
Model Internal QA Eval (Knowledge) IF-eval (Chat) Qwen3-8B 18% 85% + midtrain (100%) 43% 45% + midtrain (70%) 36% 79% + midtrain (70%) + distill 41% 83%