RL[1/n]

RL

RL [1/n]

2026-01-12

Prompt
RLVR rollout
SFT: logit-distill
- Problem
  - OOD: student trained on teacher’s in contexts
    - when tested on OOD domain, or situation,
    - “diverging ever farther from the states it observed in training”
  - imitate the teacher’s style and confidence but not necessarily its factual accuracy.
OPD: On-policy distillation
- sample trajectories from the student model and use a high-performing teacher to grade each token of each trajectory.
- per-token reverse KL $$ \mathrm{KL}\left(\pi_\theta || \pi_{\text{teacher}}\right) = \mathbb{E}{x \sim \pi\theta} \Big[ \log \pi_\theta(x_{t+1}\mid x_{1:t})
  - \log \pi_{\text{teacher}}(x_{t+1}\mid x_{1:t}) \Big] $$
- reward function: minimizes the reverse KL

Exp

Method	AIME’24	Teacher FLOPs	Student FLOPs	CE vs SFT-2M
Initialization: SFT-400K	60%	8.5 × 1020	3.8 × 1020	–
SFT-2M (extrapolated)	~70% (extrapolated)	3.4 × 1021	1.5 × 1021	1×
Reinforcement learning	68%	-	-	≈1×
On-policy distillation	70%	8.4 × 1019	8.2 × 1019	9-30×