← 返回架构图 ← 返回文章

OPD:On-policy / Online Policy Distillation

1. RLVR Student Rollout Verifier / Reward Policy Gradient Update 2. Offline Distillation Teacher Data / Output Supervised Distillation Student Update off-policy distribution shift 3. OPD (On-policy / Online Policy Distillation) Student Rollout Teacher / Distillation Service Dense Feedback Student Update token logits chunk feedback verifier-gated signal multi-teacher signal 差异不在“有没有 rollout”,而在 feedback 从哪来、有多密集 OPD:on-policy data generation + teacher dense feedback,单条链路

OPD(On-policy / Online Policy Distillation)将 on-policy data generation 与 teacher dense feedback 合并为单条链路,并引入 Teacher / Distillation Service 与 multi-model inference orchestration。