RL[1/n]

RL[1/n]

RL 

RL [1/n]

img


2026-01-12

img

On-Policy Distillation

On-Policy Distillation - Thinking Machines Lab

  • Prompt

    image-20260112103343365

  • RLVR rollout

    image-20260112103418015

  • SFT: logit-distill

    image-20260112103440474

    • Problem
      • OOD: student trained on teacher’s in contexts
        • when tested on OOD domain, or situation,
        • “diverging ever farther from the states it observed in training”
      • imitate the teacher’s style and confidence but not necessarily its factual accuracy.
  • OPD: On-policy distillation

    • sample trajectories from the student model and use a high-performing teacher to grade each token of each trajectory.

      image-20260112133746188

    • per-token reverse KL $$ \mathrm{KL}\left(\pi_\theta || \pi_{\text{teacher}}\right) = \mathbb{E}{x \sim \pi\theta} \Big[ \log \pi_\theta(x_{t+1}\mid x_{1:t})
      • \log \pi_{\text{teacher}}(x_{t+1}\mid x_{1:t}) \Big] $$
    • reward function: minimizes the reverse KL
  • Exp

    • Qwen3-8B-Base Math Task

    • 400k SFT

      img

    MethodAIME’24Teacher FLOPsStudent FLOPsCE vs SFT-2M
    Initialization: SFT-400K60%8.5 × 10203.8 × 1020
    SFT-2M (extrapolated)~70% (extrapolated)3.4 × 10211.5 × 1021
    Reinforcement learning68%--≈1×
    On-policy distillation70%8.4 × 10198.2 × 10199-30×
  • Exp

    • tulu3 internal assistant
    • We could alternate between phases of fine-tuning on new data and distillation to recover behavior to allow our model to learn and stay up-to-date on knowledge over time.
    • https://arxiv.org/abs/2009.04416
    ModelInternal QA Eval (Knowledge)IF-eval (Chat)
    Qwen3-8B18%85%
    + midtrain (100%)43%45%
    + midtrain (70%)36%79%
    + midtrain (70%) + distill41%83%

Written by Yiran //