← 返回架构图 ← 返回文章

从 Trainer 到 Runtime Loop

Core Runtime Loop Agent App Rollout Service Trajectory Store Reward / Judge Trainer Weight Sync request / action trajectory + logprob episodes reward / advantage new policy publish policy Application / Agent Layer Online Trace Task / Tool / Env State file system · sandbox · browser RL Orchestration Scheduler / Queue Observability Environment Service Teacher / Distillation Model Runtime Training Engine Inference Engine vLLM / SGLang Checkpoint / Comm Trainer → Training Engine Rollout → Inference Engine Observability trace collection (dashed) 边界变化:Rollout 负责 generation forward;Trainer 负责 policy update(batch forward + backward) Data flow: Agent App → Rollout Service → Trajectory Store → Reward / Judge → Trainer → Weight Sync → Rollout Service Observability 从 Agent App、Trainer、Rollout Service 等组件收集 trace(虚线) Env State:agent 执行任务时可读写的环境状态(代码仓库、沙箱目录、浏览器页面等),非 infra 意义上的 workspace 主链 Trainer / Rollout 为编排角色;底层分别运行在 Training Engine / Inference Engine 上(灰色虚线)

系统边界由 trainer 向 runtime 扩展:主链上的 Rollout Service / Trainer 是编排角色;底层 Model Runtime 提供 Inference Engine(generation forward)与 Training Engine(policy update 的 forward / backward,如 Megatron / FSDP)。训练链路稳定性由服务间 data loop 决定,Environment Service 与 Teacher / Distillation Service 会成为高成本、需调度和观测的 runtime 服务。图中 Env State 指 agent 执行任务时可读写的环境状态(文件系统、沙箱、浏览器页面等),不是 Ray / K8s 等 infra 层面的 workspace。