MAI-Thinking-1 技术报告阅读

发表于 2026-06-05 更新于 2026-06-06 分类于笔记本文字数： 23k 阅读时长 ≈ 21 分钟

MAI-Thinking-1 的技术报告标题是 MAI-Thinking-1: Building a Hill-Climbing Machine。

官方还有一个介绍页：Introducing MAI-Thinking-1，以及 MAI-Thinking-1 Model Card。

一句话概括：这篇报告围绕 Microsoft AI 所说的 hill-climbing machine 展开，内容覆盖数据、MoE 架构、训练系统、RL recipe、agent 环境、评测和安全红队，重点是模型能力迭代体系，而不只是单个模型的 benchmark 结果。

MAI-Thinking-1 本身是一个 35B active / 1T total parameter 的 sparse MoE reasoning model。报告强调它是 from scratch 训练的，没有用第三方模型蒸馏；pre-training 使用 30T tokens，数据来自公开和授权的人类生成内容，并且在 pre-training 阶段不使用语言模型生成的 synthetic data。

需要区分的是：报告里说的 “不蒸馏” 主要指不从第三方模型继承能力；RL 阶段内部仍然使用 self-distillation 来恢复或延续 RL climb。

本文主要整理几个点：

模型结构：MAI-Base-1 的 MoE、attention、LatentMoE 和 dropless MoE；
训练基础设施：YOLO 训练框架、硬件集群、调度和运维；
Pre-training：scaling ladder、数据 composition 和训练阶段；
Post-training / RL：RL climb、STEM / agentic / helpfulness-safety 数据与训练；
Evaluation：benchmark、人类偏好评测和安全评估。

模型结构

MAI-Thinking-1 的 base model 叫 MAI-Base-1，是 decoder-only Transformer + sparse MoE。

几个核心配置：

项目	报告口径
模型规模	35B active / 1T total
架构	decoder-only Transformer
层数	78 layers
FFN	dense FFN 和 MoE 交替
MoE	LatentMoE
experts	512 experts
每 token 激活专家	top-8
attention	5 local attention + 1 global attention 周期
local window	512
GQA	8 KV heads
tokenizer	o200k_base，vocab size 200019
context	mid-training 后扩到 256K

这个结构主要服务于训练效率、推理效率和大规模训练稳定性。

Attention 部分没有用每层 full attention，而是 5 层 local attention 搭配 1 层 global attention。local attention 用 RoPE，window size 是 512；global attention 不用 position encoding。这样可以降低训练 attention 计算量，也能减少推理时 KV cache 的规模。

FFN 部分采用 dense FFN 和 MoE 交替。它不是 every-layer MoE，而是把高稀疏 MoE 和零稀疏 dense layer 配在一起。报告里的说法是，这种 interleaved layout 在 scaling 上接近更均匀的 medium-sparsity MoE，但 wall-clock 训练效率更好。

MoE 使用 LatentMoE：先共享 down-projection，再做 all-to-all dispatch；routing 仍然基于原始 representation，每个 compressed representation 被路由到 512 个专家中的 8 个。报告将其作为降低专家通信和计算成本的一项设计。

另一个工程细节是 dropless MoE。很多 MoE 实现会设置 expert capacity，超出 capacity 的 token 被 drop。报告里说他们最终收敛到 fully dropless MoE，并支持 variable message size all-to-all 和 bounded memory usage。这个设计与训练稳定性有关：如果 token dropping 存在，routing / load balance 的结果会受到 capacity 设置影响。

训练基础设施

训练基础设施部分主要包括 YOLO 训练框架、底层硬件集群、Kubernetes/Kueue/Ray 调度运行时，以及围绕 goodput 的稳定性和可观测性体系。

YOLO 训练框架

报告里的训练系统叫 YOLO，展开是 You Only Launch Once。

YOLO 是 Microsoft AI 自研的大规模训练框架，基于 PyTorch，覆盖 pre-training、mid-training、SFT 和 RL training。它实现了 model definition、sharding、optimizer、dataloader、checkpointing 等核心训练循环。

报告没有说明 YOLO 与 DeepSpeed 的实现关系；虽然二者都覆盖 ZeRO、MoE、sharding、checkpointing 等大规模训练问题域，但不能据此判断 YOLO 是否复用了 DeepSpeed。

报告列出的训练系统能力包括：

自定义 FP8 GEMM、Grouped GEMM、quantization kernels；
MoE 支持 dropless / capacity-capped、多种 load-balancing 策略、router replay；
expert dispatch / compute / collect 做 pipeline overlap；
activation checkpointing + activation offloading；
bitwise reproducibility；
checkpoint 保存 model weights、optimizer state、FP8 scaling history、dataloader progress、RNG；
把 goodput 当成生产 KPI，而不仅仅是 MFU。

报告里提到 MAI-Base-1 pre-training 在 8K GPUs 上达到 90.0% goodput。goodput 不同于单纯的 MFU，它会受到 crashloop、node failure、link flap、OOM、checkpoint stall、recompute、slow startup、调度延迟、恢复后 MFU 下降等因素影响。

从 infra 角度看，这部分说明报告关注的不只是单步训练吞吐，也包括故障恢复、可复现性、checkpoint 和长周期训练的有效利用率。

报告给出的 goodput 定义是：

1	goodput = ideal training duration / actual wall-clock duration

其中 ideal training duration 可以理解为：如果训练过程一直以目标状态稳定 stepping，没有失败、重算、启动等待、checkpoint stall、MFU drop 等 overhead，完成同样训练进度所需的理想时间。actual wall-clock duration 则是实际从训练开始到完成所花的墙钟时间。

也可以按 overhead 分解成：

1
2
3

actual wall-clock duration = ideal training duration + total overhead

goodput = ideal training duration / (ideal training duration + total overhead)

因此：

1	overhead ratio = 1 - goodput

MAI-Base-1 的 90.0% goodput 表示：实际消耗的 wall-clock time 中，大约 90% 对应理想训练进度，约 10% 是各种 overhead。报告进一步把 overhead 拆成 recomputation、non-stepping time、MFU drop overhead 等类别。

这个指标和 MFU 的关系是：MFU 衡量 stepping 时 GPU 理论算力被模型计算利用了多少；goodput 衡量端到端训练墙钟时间里有多少真正转化成有效训练进度。因此，一个 run 可以有较高 MFU，但如果频繁失败、重启、重算或 checkpoint 卡住，goodput 仍然会很低。

报告还披露了几组训练系统指标：

MAI-Base-1 pre-training 在 8K GPUs 上达到 90.0% goodput；
total overhead 降到 51 hours；
recomputation 为 6.5 hours，占 overhead 的 15%；
non-stepping time 为 14 hours，占 overhead 的 27%；
MFU drop overhead 为 18 hours，占 overhead 的 35%，是最终 run 中最大的剩余 overhead 类别；
在架构演进中，早期 GB200 NVL64 设置的 MFU 从 18% 提升到 22%；
GPU Direct RDMA 约带来 1.1x end-to-end training time 改善；
自定义 block-sparse attention backend 约带来 1.06x step time 改善；
ZeRO-2 相关优化约带来 1.03x end-to-end 改善；
自定义 Triton expert encode kernel 将 HBM utilization 从约 10% 提升到约 80%，并带来约 1.03x end-to-end 改善。

推理部署侧，报告称 MAI-Thinking-1 在 MAIA-200 上实现后，相比 GB200-based deployment，在相同 rack power budget 下 token generation throughput 高 40% 以上。这个指标是 performance per watt 口径，不是单卡峰值吞吐。

硬件和集群架构

MAI-Thinking-1 的主训练硬件是 Microsoft Azure 平台内 Microsoft-operated 的单站点 GB200 集群。这个部分的信息主要来自报告正文的 Section 6 Cluster Environment 和 Appendix K Cluster Environment Details；外部链接只用于补充硬件、网络和调度组件背景。

报告没有说明使用了 Azure Machine Learning。它披露的是：MAI-Base-1 在 Azure platform 内的 Microsoft-operated cluster 上，从 8K GB200 GPUs 开始做 from-scratch pre-training；调度和运行时栈包括 Kubernetes、Kueue、MAI control plane、Ray、NCCL 和 YOLO 训练框架。下文按 Azure-based first-party infrastructure 理解。

下图是基于报告文字整理的系统架构示意。

图中 GHR 是 guest health report，指节点内部运行环境上报的健康信息，用于辅助判断节点是否需要进入 vendor maintenance。NPD 是 Kubernetes Node Problem Detector，负责把节点侧异常上报为 Kubernetes node conditions 或 events。admitted job 指已经通过集群准入 / 调度流程，并获得相应计算资源的训练作业。

报告里的硬件口径如下：

用途	硬件
主 pre-training run	8K NVIDIA GB200 GPUs
Pre-training / Mid-training 1	8,192 GB200 GPUs
Mid-training 2	4,096 GB200 GPUs
早期原型和实验	H100
开发、验证、profiling、下一代 bring-up	H100 / GB200 / GB300
推理部署优化	Microsoft MAIA-200

GB200 和 GB300 集群部署在 Microsoft first-party datacenters。这里的 first-party datacenters 可理解为 Microsoft 自有或直接运营的 Azure 数据中心资源。

报告还提到，这些集群通过 Azure 团队共同维护的 custom images 暴露给 MAI。从 Appendix K 的 cluster provisioning、node lifecycle、certification 和 telemetry 语境看，这里的 custom images 更像是节点级操作系统镜像，可能包含 OS、GPU driver、RDMA/NCCL 相关组件、诊断与 telemetry agent 等基础软件栈；报告没有明确说明它是 OS image 还是 container image。

主训练被放在一个 single logical cluster、one site 上，主要是为了降低实验方差：同一代 accelerator、稳定 rack health、稳定 scheduler 行为、可预测 storage path。

硬件拓扑上，GB200 / GB300 系统以 rack-scale NVL72 为单位部署：

每个 rack 是一个 72-GPU NVLink domain；
NVLink / NVSwitch 负责 rack 内 scale-up 高带宽通信；
rack 间 scale-out 通信用 InfiniBand RDMA；
为了训练稳定性，报告中实际使用 64 GPUs per rack，即 NVL64，保留 spare capacity 来容忍 node failure 和 unhealthy devices；
H100 系统仍在 lab 环境里使用，形态是 8-GPU nodes，node-local NVLink/NVSwitch，跨节点 InfiniBand。

可以简化理解为：

GB200 rack / NVL72
  72 GPUs in one NVLink domain
  report training placement uses 64 GPUs per rack (NVL64)

multiple racks
  connected by InfiniBand RDMA

large training job
  keep expert all-to-all inside NVL64
  use cross-rack InfiniBand mainly for data parallel communication

这个拓扑也影响了模型并行策略。为了提高 GEMM efficiency，MAI-Base-1 选择：

expert parallelism，EP = 64；
tensor parallelism，TP = 1；
expert all-to-all communication 保持在 NVL64 domain 内；
cross-rack InfiniBand 用于 data parallel communication，比如 parameter all-gather 和 gradient reduce-scatter；
pre-training 和 mid-training 1 使用 EP=64 + ZeRO-2；
mid-training 2 启用 ZeRO-3 / FSDP；
mid-training 阶段使用 context parallelism。

集群切分和逻辑集群

Appendix K 里有一句关键描述：每个 site 会被切分成多个 Kubernetes clusters，通常是一栋 datacenter building 对应一个 Kubernetes cluster。

可以理解为：

physical site / datacenter campus
  datacenter building A -> Kubernetes cluster A
  datacenter building B -> Kubernetes cluster B
  datacenter building C -> Kubernetes cluster C

这种切分不是说一个训练任务只能在一栋楼里运行，而是把物理基础设施按 building 边界组织成多个 Kubernetes 管理单元。报告同时提到 large jobs 在需要时可以跨越单个 Kubernetes cluster 的边界：nodes are universally routable across the compute environment，workload pods use host networking，以减少 overlay network overhead。

这里的几个层次可以分开看：

层次	含义
site	一个物理数据中心站点或数据中心园区
datacenter building	site 内的一栋数据中心楼，通常有自己的电力、冷却、网络和 rack 布局
Kubernetes cluster	软件层面的资源管理单元，通常按 building 切分
logical cluster	面向训练和调度暴露的逻辑资源池，包含 GPU nodes 和 CPU support nodes
fleet-wide view	跨 cluster / scheduler backend 的统一可见性和运维视图

报告称主训练放在一个 single logical cluster、one site 上，目标是降低实验方差：同一代 accelerator、稳定 rack health、稳定 scheduler 行为、可预测 storage path。这里的 logical cluster 不应直接等同于单个 Kubernetes cluster；Appendix K 同时说明一个 site 通常会按 datacenter building 切分为多个 Kubernetes clusters。因此，8K GB200 是否全部位于同一个 Kubernetes cluster 内，报告没有明确披露。

调度和控制面

控制面和调度架构可以拆成：

Kubernetes 维护 cluster state；
Kueue 负责 quota、admission、priority、preemption 和 topology-aware placement；
MAI cluster-local control plane 管理 reservation、rack topology、quota coherence 和 scheduling-readiness gates；
Ray 在 admitted jobs 内执行 distributed runtime；
MAI drivers 把调度得到的 topology 转成 actor placement、communication groups 和 NCCL clique configuration。

这里的 MAI control plane 不是报告中披露的公开 Kubernetes 插件名。按 Appendix K.3 的描述，它更像一组 MAI 内部的 cluster-local controllers：不替代 scheduler，而是维护 scheduler 所需的 reservation、rack topology、quota coherence、scheduling-readiness gates 等状态。是否实现为 Kubernetes CRD/controller、scheduler plugin 或其他内部组件，报告没有进一步披露。

Kueue 负责 admission 和 topology-aware placement。MAI cluster-local control plane 则维护 Kueue / scheduler 做决策所需的状态，例如 rack reservation、topology labels、scheduling-readiness gates。报告特别提到 rack fragmentation 风险：许多小任务如果随机占满不同 rack，会让后续大任务很难拿到连续、拓扑紧凑的容量。为此，cluster-local control plane 会维护 soft rack reservations。队列可以有 preferred racks；空闲时可以借用容量，需要时再通过 reclaimWithinCohort reclaim reserved racks。

Ray runtime 和训练作业

Ray 是 admitted job 内部的 distributed runtime。这里的 admitted job 指已经通过集群准入 / 调度流程，并获得相应计算资源的训练作业。Kueue 完成 admission 和 placement 后，MAI drivers 会把得到的 topology 转成 Ray actor placement、communication groups 和 NCCL clique configuration。

不同类型任务的 actor 需求不同：

pre-training jobs 主要要求 strict learner availability；
RL jobs 会包含 learners、inference servers、rollout workers、routers 等多种 actor；
MAI drivers 负责监控 actor liveness、协调训练循环，并维护异步组件之间的 checkpoint consistency。

这部分和 RL training 尤其相关。RL 不是单一同步训练循环，而是 learner、推理服务、rollout、reward / grader 等多类组件共同工作。报告把 Ray 放在 admitted jobs 内部；按该描述，Ray 的作用范围主要在作业内，用于 actor 编排和运行时管理，集群级排队、准入和资源分配仍由 Kubernetes、Kueue 以及 MAI control plane 等组件承担。

Certification 和节点生命周期

报告强调 physical topology 和 hardware health 是 first-class scheduling state。节点不是 provisioned 就可用，而是要经过 certification。certification 的目的，是防止坏节点、退化链路、边缘状态存储和 silent-corruption 风险进入生产训练池。

certification 分层进行：

Stage	检查内容
single-node diagnostics	GPU、CPU cores、HCA、NVLink links、main memory
rack-level collectives	通过 NCCL collectives 检查 rack 内多节点通信和 NVLink / NVSwitch 行为
cross-rack InfiniBand validation	检查跨 rack、rails、leaf groups、spine-layer path diversity 和 RDMA performance

节点生命周期见报告 Figure 26：

Figure 26. Node lifecycle from MAI-Thinking-1 technical report

状态迁移可以直接理解为：

新节点或修复后的节点先进入 Init，再由 certification controller 进入 Certifying；
certification 通过后进入 Available，失败则进入 Impaired；
Available 节点在出现 NPD condition 或 manual drain 后会进入 Impaired；
如果是误报，可以从 Impaired 回到 Available；
如果需要自动修复，则从 Impaired 进入 Auto Remediating，例如 auto reboot、reset 或 soft drain，之后回到 Init 重新认证；
如果自动修复不足以解决问题，则通过 GHR 进入 Vendor Maintenance；
vendor 维修成功后进入 Repaired，再回到 Init；维修失败则进入 Decommissioned。

这里 Impaired 表示节点已被判定为不健康或不适合继续进入生产训练池；Auto Remediating 表示系统先尝试自动修复，例如 reboot、soft drain 或 reset。NPD 是 Kubernetes Node Problem Detector，负责把节点侧异常上报为 Kubernetes node conditions 或 events。GHR 可理解为 guest health report，即节点运行环境内部上报的健康信息，通常来自 guest OS 或节点级 agent，用于补充平台侧硬件健康信号。Runtime monitoring 会根据 NPD conditions、XID errors、ECC thresholds、NVLink degradation、InfiniBand link flaps、storage faults 等信号触发状态迁移。修复后的节点不会直接回到 Available，而是回到 Init 并重新经过 certification。

异常处理时序可以整理成：


sequenceDiagram
participant Telemetry as Telemetry / Runtime Monitoring
participant Controller as Certification / Remediation Controller
participant Scheduler as Scheduler / Kueue
participant Node as Node
participant Vendor as Vendor / Datacenter Maintenance

Telemetry->>Controller: XID / ECC / NVLink / IB / storage fault
Controller->>Scheduler: mark node unschedulable / drain
Controller->>Node: attempt auto remediation

alt transient issue fixed
  Node-->>Controller: reboot / reset / soft drain succeeds
  Controller->>Node: reset to Init
  Controller->>Node: run certification
  Controller->>Scheduler: mark node Available
else persistent hardware issue
  Controller->>Vendor: GHR / guest health report
  Vendor->>Node: repair or replace
  Node-->>Controller: repaired
  Controller->>Node: reset to Init
  Controller->>Node: re-run certification
  Controller->>Scheduler: mark node Available if cert-pass
else maintenance fail
  Vendor->>Controller: maintenance fail
  Controller->>Scheduler: decommission node
end

Telemetry 和 observability

报告把 observability 放进控制回路，而不是只做 dashboard。硬件 telemetry、fabric health、storage behavior、scheduling state 和 job progress 会决定容量是否 admitted、drained、remediated 或 returned to service。

硬件健康信号包括：

GPU XID；
ECC；
thermals、power、clock throttling；
NVLink state、NVLink bit-error rate、chip-to-chip links；
InfiniBand device state；
local NVMe health；
PCIe errors；
driver state。

这些信号会转成 Kubernetes node conditions，再进入 scheduling、triage、drain 和 remediation controllers。

作业可观测性则横跨 Kueue、Kubernetes、Ray、training logs 和 experiment metadata。报告提到 operator 可以按 namespace、pod、job、restart index 查看 queue、priority、admission state、node placement、worker readiness、restart count、training configuration、step progress 和 scoped logs。这样可以区分 scheduling delay、runtime failure、node failure、storage degradation 和 application-level stalls。

Telemetry 存储和查询也分层：

系统	用途
Datadog	near-real-time metrics 和 log search
Azure Managed Prometheus	in-cluster / cross-cluster time-series collection
Azure Data Explorer	long-retention logs、metrics、storage telemetry、cluster state
Azure Monitor	resource 和 Prometheus alerts

因此，硬件架构不只是 “8K GB200 GPUs”，而是一套围绕 usable training capacity 设计的系统：rack 内 NVLink 负责高带宽局部通信，rack 间 InfiniBand 负责扩展到多 rack；调度器尽量保持 locality；certification 和 telemetry 防止坏节点、坏链路和 silent corruption 进入训练池。

参考信息

本节事实来源主要是 MAI-Thinking-1 技术报告的 Section 6 Cluster Environment 和 Appendix K Cluster Environment Details。下面这些外部链接只作为背景参考，用来解释报告中出现的硬件、网络和调度组件，不作为 MAI 训练细节的独立来源。

链接	说明
Microsoft Azure	Azure 平台背景
NVIDIA GB200 NVL72	GB200 / NVL72 硬件形态
NVIDIA GB300 NVL72	GB300 / NVL72 硬件形态
NVIDIA NVLink	NVLink / NVSwitch 高带宽互联背景
NVIDIA InfiniBand	InfiniBand / RDMA 网络背景
Kubernetes	Kubernetes cluster state 和控制面背景
Kueue	Kubernetes batch queueing、admission 和 quota 管理背景
Ray	distributed runtime 背景
NCCL	GPU collective communication library 背景
Azure Maia	Microsoft Maia 系列背景；报告未披露 MAIA-200 具体芯片规格

Pre-training

Pre-training 部分主要整理 scaling ladder、pre-training 数据 composition，以及 pre-training / mid-training 的训练阶段。

Scaling ladder

报告反复强调 scaling ladder：架构和数据决策不只看单个小规模实验，而要看收益能否沿着 scale 稳定成立。

他们用 scaling ladder 做架构和数据消融：对不同 model size，用固定的 tokens per active parameter 训练，比较 scaling curve。多数 architecture ablation 在接近 Chinchilla optimal 的 100-200 TPP 做，而主训练会 over-train 到 500-1000 TPP，让模型更适合高频推理场景。

这个方法的前提是：小模型上的改进不一定能迁移到大模型；某个数据 mixture 在小规模上更好，也不代表在大规模上排序不变。因此，报告把可扩展验证放在 pre-training 决策的中心。

这也是 “hill-climbing machine” 的组成部分：架构、数据和训练系统都通过 ladder + efficiency gain 做评估，以支持持续迭代。

Pre-training 数据

MAI-Base-1 使用 30T tokens 预训练。数据来源包括：

web HTML；
web PDFs；
public GitHub code；
books and journals；
academic papers；
news；
multilingual text；
domain-specific materials。

报告披露了几个数据治理口径：

不使用 open source training datasets；
不使用语言模型生成的 synthetic data 做 pre-training；
尽力移除采集源里的 AI-generated content；
排除常见机器学习数据站点和仓库，比如 huggingface.co 一类来源；
不使用 Microsoft 产品和服务里的 private customer data，除非用户明确 opt in 或适用协议允许；
对整个 corpus 做 PII-risk 和 safety filtering。

报告披露的知识截止日期也比较细：

Source family	Knowledge cut off date
Web HTML pages	September 2025
Web PDFs	December 2025
Public GitHub Code	June 2025
Books and journals	March 2026

从报告披露看，MAI-Base-1 的 pre-training 数据策略强调 clean、licensed 和 human-generated。

预训练数据 composition 的数字如下：

Source family	Unique tokens	Training tokens	Mix	Avg. epochs
Code	7.4T	16.4T	54.6%	2.22x
STEM	2.2T	4.7T	15.8%	2.17x
Math	0.3T	1.6T	5.4%	5.28x
Books and journals	0.6T	0.9T	3.1%	1.65x
PDFs	2.7T	1.4T	4.7%	0.53x
Web text	8.1T	4.5T	14.9%	0.55x
Multilingual (other)	8.1T	0.5T	1.6%	0.06x
Total	29.2T	30.0T	100.0%	1.03x

这里有几个指标比较关键：

code 占 54.6%，是最大的数据来源；
STEM + Math 合计 21.2%，并且 Math 的平均采样 epoch 最高，达到 5.28x；
Web text 和 PDFs 的可用 unique tokens 没有被完整耗尽，平均 epoch 分别是 0.55x 和 0.53x；
multilingual other 只有 1.6% training mix，但报告说明 domain-specific multilingual data 会被计入其他类别。

mid-training 的数据仍来自 pre-training corpus，不引入新的 synthetic source。报告披露的目标 mixture 是：code 55%，STEM/math 35%，background sources 10%。

训练阶段的规格如下：

Phase	Tokens	Context length	GB200 GPUs
Pre-training	30T	16,384	8,192
Mid-training 1	3.4T	65,536	8,192
Mid-training 2	150B	262,144	4,096

也就是说，MAI-Base-1 先在 16K context 上完成 30T tokens 的主预训练，再通过 3.55T tokens 的 mid-training 把 context 扩展到 256K。

Post-training / RL

Post-training / RL 部分包括通用 RL recipe、STEM climb、agentic coding / tool use、helpfulness / safety，以及最后的 consolidation。

RL climb

pre-training 和 mid-training 提供模型预测能力和知识；解题策略、推理 token 分配、工具使用、偏好和安全约束主要在 RL climb 阶段处理。

MAI-Thinking-1 的 RL 从没有 reasoning trace 暴露的 checkpoint 开始，目标是让模型从零发展 reasoning 能力。报告里把稳定 RL 爬坡归因到三个机制：

对 GRPO 做两个简单但关键的调整；
self-distillation，用于 crash 或 base policy 更新后恢复 climb；
infra 改进，减少 training 和 inference 之间的 numerical mismatch。

他们训练了三个 specialist：

STEM / competitive code；
agentic coding / tool use；
helpfulness / safety。

之后再把这些 specialist consolidation 到单个 MAI-Thinking-1 模型。

GRPO 有两个主要改动。

第一个是 adaptive entropy control。它不是显式加 entropy bonus，而是根据目标 entropy 在线调整 clipping bound。如果 entropy 太低，就放宽上界，让 policy 能更积极地增加 alternative tokens 的概率；如果 entropy 足够高，就收紧 trust region。

第二个是 outer ratio clip。原始 PPO / GRPO 的某些分支在 “往正确方向修正” 时不 clip，但报告发现这些 unclipped branches 会导致 catastrophic gradient-norm spikes，于是加了 hard outer clip。

reward 结构也保持统一：

1	R = R_task + w_lang * R_lang - w_len * R_len

也就是 task reward 加语言一致性奖励，再减长度惩罚。语言一致性奖励用于缓解长上下文 RL 中的语言漂移；长度惩罚用于控制 reasoning 长度。

RL 训练的部分超参数也有披露：

top-p sampling 使用 p = 0.97；
早期最大 rollout length capped at 8K tokens；
随训练推进，rollout length 按 2 的幂次扩展，最终到 128K tokens；
在 128K extension stage 移除 length penalty，即 w_len = 0；
problem sampling 设置 G = 128 total rollouts，G_early = 16；
early pass-rate filtering 使用 [0.05, 0.8]，常规 pass-rate filtering 使用 [0.1, 0.8]；
每次 inference model update 之间做 5 个 gradient steps；
超过 8 次 inference updates stale 的 rollout 会被丢弃，也就是最多落后 40 gradient steps；
RL 阶段 global MoE load balancing coefficient 为 1e-5；
self-distillation SFT 使用 128K sequence length、global batch size 2048、AdamW weight decay 0.001、maximum learning rate 1.7e-5、minimum learning rate 5.2e-6、warmup ratio 2%；
self-distillation dropout rate 为 0.15，MoE load balancing coefficient 为 1e-2。

STEM 数据

STEM climb 使用 STEM Mix。报告说他们处理了数百万文档，产出超过 5M samples，其中最难部分超过 550k 个 (q, a) pairs。

这个 pipeline 将 textbooks、academic PDFs、forum discussions、competition archives、vendor problems 等异构来源转成可验证的问答对。

流程包括：

OCR 和 cleanup；
去 boilerplate，规范化文本；
构建层级结构；
LLM 标注 question / answer spans；
对分离的题目和答案做 QA pairing；
标注题型、topic taxonomy、PII、answer leakage；
把选择题、证明题等重写成 open-ended；
多 solver pass@k 解题；
consensus grading；
difficulty rating；
丢掉 faulty ground truth。

这个部分的重点是构造可验证、难度合适、topic 多样且 ground truth 可靠的训练信号。报告把 verifier / grader / data pipeline 作为 STEM RL 数据构造的核心环节。

STEM Mix 的样本分布也有披露。

按原始题型：

Problem format	Share
Open ended	56.1%
Proof	33.3%
MCQ	10.6%

按学科 taxonomy：

Subject	Share
Mathematics	58.5%
Physics	13.2%
Chemistry	10.9%
Other	4.3%
Electrical Engineering	3.4%
Computer Science	2.6%
Mechanical Engineering	2.6%
Biology	1.9%
Mechanics of Materials	1.0%
Civil Engineering	0.9%
Economics	0.7%

Multiple-choice 和 proof problems 会在 ingestion 过程中转换成 open-ended form；转换不可行的样本会被丢弃，但报告保留了少量 multiple-choice problems，让模型仍然熟悉该格式。

Agentic coding 和 tool use

Agentic climb 训练模型在外部环境里做多步任务：读代码、改文件、跑测试、观察失败、修复方案、再尝试。

报告里的 Sandbox Execution Environment（SEE）会为每个 agentic task 启一个 fresh container，任务完成后销毁。容器默认 network-isolated，以保证可复现，避免 rate limit、外部服务波动等副作用。如果确实需要网络，比如安装包，则通过 caching proxy 和 domain allowlist 管理。

SWE RL problem 会被打包成自包含容器镜像：

repo checkout 到指定 commit；
依赖预装；
problem statement；
unit tests / grader；
Bash tool；
String replace editor。

模型通过 tool call 和容器交互，结束后 grader 在同一个容器里跑测试并给 reward。

这个设计把 SWE-bench 式任务扩展为可规模化生产的 RL environment。它不是只用于评测的 benchmark，而是用于训练的环境工厂。

tool-use 环境则模拟企业和消费场景中的 API / MCP 交互。每个问题包含 query、tool schemas、initial state 和 grader。有些环境里单个任务可用工具超过 50 个，用来训练模型高效选择合适工具。

他们还会合成 closed-world tool-use environment：生成数据库、tool definition、verifiable task，然后执行和去重。这里 synthetic data 用在 RL / tool-use 环境构造上，而不是 pre-training。

Helpfulness 和 safety

报告把 helpfulness 和 safety 放在同一个 RL 框架里，而不是把 safety 当成模型发布前的独立补丁。

一个重点是 reward aggregation。很多目标的 reward scale 不同，简单相加会导致大尺度信号压过小尺度信号；而安全这类目标又不能被 response quality 抵消。报告里用了两种策略：

lexicographic reward shaping：高优先级 reward 打平时，低优先级 reward 才起作用；
gated reward application：高优先级目标达到最低要求后，低优先级 reward 才应用。

比如安全属于 gated case：unsafe response 直接拿最低 reward，不再因为 response quality 较高而得到补偿。

另一个细节是风格训练。报告里的目标风格包括 warmth without sycophancy、scannable structure、tone calibrated to context、避免冗长开场等。这部分属于 post-training 中面向可用性的训练目标。

Consolidation

三个 specialist 最后要合到一个模型。报告给了 consolidation SFT 的数据 mixture：

Capability	Sample weight	Token weight
STEM and Coding	56%	89%
Agentic Capability	11%	9%
General Helpfulness and Safety	33%	2%

这个表显示：general helpfulness / safety 的 sample weight 很高，但 token weight 很低；STEM and Coding 的 token weight 极高。这说明 STEM / code 样本单条更长，尤其 reasoning 和 agentic 轨迹会消耗大量 token。

从 token weight 看，推理模型训练不能只看样本数，还需要看 token budget 和 rollout 长度。

评测

报告里的 STEM 和 agentic coding public benchmark 指标如下。MAI-Thinking-1 的结果是 4 runs average，统一使用 temperature = 1、top-p = 0.97；agentic coding 使用 256K total context length，其他表内 eval 使用 maximum output tokens 256K。

Benchmark	MAI-Thinking-1	Sonnet 4.6	Opus 4.6	GPT 5.4	Kimi K2.6	DeepSeek V3.2	DeepSeek V4	GLM-5.1
AIME 2025	97.0	95.6	99.8	-	-	93.1	-	-
AIME 2026	94.5	-	-	-	96.4	-	-	95.3
HMMT Feb 2026	84.9	-	-	-	92.7	-	95.2	82.6
GPQA Diamond	84.2	89.9	91.3	92.8	90.5	82.4	90.1	86.2
LCB v6	87.7	-	-	-	89.6	83.3	93.5	-
Terminal-Bench 2.0	46.0	59.1	65.4	75.1	66.7	46.4	67.9	69.0
SWE-bench Verified	73.5	79.6	80.8	-	80.2	73.1	80.6	-
SWE-Bench Pro	52.8	-	53.4	57.7	58.6	-	55.4	58.4

报告中的结果显示，MAI-Thinking-1 在 broad set of benchmark categories 上表现较强，但不是所有任务都领先。比如 Terminal-Bench 2.0 低于 GPT 5.4、Opus 4.6、Kimi K2.6、DeepSeek V4 等；SWE-Bench Pro 接近 Opus 4.6，但低于 GPT 5.4 / Kimi K2.6 / DeepSeek V4 / GLM-5.1。

因此，从报告自身呈现看，MAI-Thinking-1 的定位不是所有 benchmark 的绝对 top-1，而是一个强调自研训练体系、可部署性和持续迭代能力的 reasoning model。

其他 public benchmark 指标如下，Sonnet 4.6 是报告作者在自有 evaluation suite 中生成的结果：

Category	Benchmark	MAI-Thinking-1	Sonnet 4.6
Knowledge	MMLU Pro	85	87
Knowledge	SimpleQA Verified	31	29
Instruction Following	IF Bench	69	50
Instruction Following	Adv. IF	85	86
Instruction Following	Multi-Challenge	53	57
Long Context	GraphWalks <=128k	90	96
Tool Calling	BFCL v3	72	76
Safety	AIR-Bench	88	88
Safety	CyberSec Instruct	63	62
Safety	CyberSec Auto	63	56
Honesty	Long Fact	98	98
Honesty	Truthful QA	88	88
Health	HealthBench Prof.	35	38
Health	MedXpert QA	43	49

这张表的对比对象比 Table 11 少，原因是很多实验室没有在 model card 或 announcement 里报告这些 benchmark。报告因此只给了 Sonnet 4.6 作为 baseline。

官方介绍页和报告还给出人类 side-by-side evaluation。最终 evaluation set 包含 1276 个英文任务，其中 30% 是 multi-turn。任务来源包括 expert-authored prompts 和经过 PII 过滤的 Microsoft consumer Copilot logs。评估由 Surge AI 管理的 native English raters 完成。

任务分布如下：

Task category	Share of prompts
Open QA	13-14%
Brainstorming and advising	13-14%
Content authoring	13-14%
Structured problem-solving	6-7%
Information extraction	6-7%
Academic help	6-7%
Insight generation	6-7%
Content summarization	6-7%
Task planning	5%
Context-based QA	5%
Other text analysis	5%
Personal support	3-4%
Entertainment	3-4%
Chit-chat	3-4%
Role-play	3-4%

Human eval 结果：

Metric	vs Sonnet 4.6	vs Opus 4.6
Overall side-by-side preference	0.07 ± 0.06	-0.07 ± 0.06
Instruction following delta	-0.01 ± 0.02	-0.04 ± 0.02
Factuality delta	-0.02 ± 0.02	-0.03 ± 0.02
Conciseness and relevance delta	0.11 ± 0.02	0.07 ± 0.02
Completeness delta	-0.01 ± 0.02	-0.02 ± 0.02
Style and tone delta	0.08 ± 0.02	0.05 ± 0.02

整体胜负比例：

Comparison	Win	Tie	Loss
MAI-Thinking-1 vs Sonnet 4.6	49%	6%	45%
MAI-Thinking-1 vs Opus 4.6	43%	5%	52%

也就是说，人类偏好评测中，MAI-Thinking-1 相对 Sonnet 4.6 略占优；相对 Opus 4.6 略落后。分项上，它相对 Sonnet 4.6 的主要优势来自 conciseness/relevance 和 style/tone；instruction following、factuality、completeness 大致在噪声范围内。

安全侧，报告披露了两个内部指标方向：

safety / over-refusal：低风险 prompt 计算 over-refusal rate，helpfulness 报告为 1 - over-refusal rate；高敏感 item 用 1-5 Likert safety judge，score > 3 视为 safety pass；
jailbreak：从 vendor、internal red-teaming、HarmBench、StrongREJECT 等来源收集 2.5K unique seed scenarios，再增强成约 9.5K jailbreak prompts；指标是 attack success rate，越低越好。

报告没有在正文文本中给出 safety figure 的逐项数值表，但描述称 MAI-Thinking-1 在 8 个类别中的 5 个相对 Sonnet 4.6 更好或相当，较大提升出现在 CBRN、Self Harm、Elections & Politics；jailbreak ASR 与 Sonnet 4.6 和 Opus 4.6 comparable。

总结与启发

这篇报告的核心关键词是 machine。

报告并不只围绕某个单点技术展开，例如 attention、experts、tokens 或 benchmark 分数；它更强调模型开发体系，即如何把数据、模型、训练、RL、环境和评测组织成持续迭代流程。

这个体系可以拆成几层：

数据层：干净、授权、人类生成、可治理；
模型层：MoE + local/global attention + LatentMoE，为训练和推理效率服务；
实验层：scaling ladder + efficiency gain，避免小规模幻觉；
训练层：YOLO、determinism、dropless MoE、goodput；
RL 层：稳定 GRPO、reward shaping、self-distillation；
环境层：SEE、SWE container、tool-use closed-world environment；
产品层：helpfulness、安全、style、instruction following；
评测层：public benchmark、人类偏好、安全红队。

从训练基础设施视角看，有几个启发比较明确。

第一，训练基础设施需要有明确的目标指标。报告把 goodput 放在生产 KPI 位置，而不只看 MFU 或单步吞吐。对长周期训练来说，节点故障、调度等待、重启、checkpoint 恢复、storage 抖动和通信退化都会进入 wall-clock 成本；因此，基础设施目标应更接近 实际有效训练时间 / 总 wall-clock 时间，而不是单点硬件利用率。

第二，调度系统需要理解硬件拓扑。GB200 这类 rack-scale 系统不是一组同质 GPU 池；NVLink domain、rack boundary、InfiniBand path、reserved spare capacity 都会影响训练稳定性和通信成本。Kubernetes / Kueue / MAI control plane 的价值不只是把 Pod 调起来，而是把 quota、reservation、rack locality 和 topology-aware placement 组织成可持续运行的大作业调度能力。

第三，健康检查和 remediation 要进入调度控制回路。报告中的 certification、NPD conditions、GHR、telemetry、drain、auto remediation 和 recertification 共同决定节点能否进入生产训练池。对训练基础设施来说，坏节点、退化链路和边缘状态存储如果只靠人工排查，会直接降低 goodput；更合理的做法是把健康信号转成可调度状态，并让修复后的节点重新经过 certification。

第四，训练框架和集群控制面需要边界清晰。Kubernetes / Kueue / MAI control plane 负责资源准入、拓扑放置和节点状态；Ray 主要在 admitted job 内部承担 actor 编排和 runtime 管理；YOLO 负责训练循环、sharding、optimizer、checkpoint 和 MoE 相关优化。这个分层可以减少职责混杂，也便于分别优化调度效率、作业恢复和训练性能。

第五，RL 环境越来越接近生产系统。它不是离线数据集，也不是简单 judge，而是一批可执行、可复现、可评分、可并发调度的任务环境。持续构造这样的环境，是获得高质量训练信号的重要条件。因此，训练基础设施不仅要支持 GPU 大作业，也要支持容器化任务环境、工具调用、reward / grader、rollout 和 learner 之间的复杂编排。

第六，可复现和可恢复能力会影响模型迭代速度。报告强调 determinism、checkpoint/restart、dataloader progress、RNG、FP8 scaling history 等状态保存。对于大模型训练，这些能力不只是工程洁癖，而是缩短故障恢复时间、降低实验方差、复盘训练异常和稳定推进 RL climb 的基础条件。

因此，这篇报告可以看成 Microsoft AI 对其自有模型训练体系的一次系统披露：从数据、训练系统、RL 环境到产品评测，形成一个面向 reasoning model 的迭代闭环。

报告仍有不少未披露的部分：比如完整数据来源、各阶段 token / compute budget、更细的 RL rollout 规模、reward model 细节、agentic environment 的实际数量、human eval 的完整 prompt 分布等。这些信息会影响外部读者对 hill-climbing machine 长期效率的判断。

总体看，MAI-Thinking-1 的技术意义不仅在于单次 benchmark 表现，也在于 Microsoft 是否能够把 from-scratch pretraining + 自研 RL infra + 企业场景环境 + Foundry 分发 串成稳定迭代系统。后续 MAI 系列的迭代速度和能力边界，需要继续结合模型发布、评测结果和实际产品表现观察。

MiniMax M2 Series 技术报告阅读

发表于 2026-06-04 分类于笔记本文字数： 16k 阅读时长 ≈ 14 分钟

MiniMax M2 系列技术报告的标题是 The MiniMax-M2 Series: Mini Activations Unleashing Max Real-World Intelligence。

NVIDIA 也有一篇部署侧性能文章：MiniMax M2.7 Advances Scalable Agentic Workflows on NVIDIA Platforms for Complex AI Applications。

一句话概括：M2 系列不是单纯把模型参数做大，而是在尝试把 低激活 MoE backbone、agent 数据 pipeline、agent-native RL 系统 Forge、长上下文推理优化 组合成一套面向真实 agent workflow 的训练和部署方案。

本文主要整理四类信息：

模型参数和结构；
pre-training / post-training 用了哪些技术；
Forge 这套 RL 工程系统是什么；
M2.7 的指标到底提升在哪里。

模型参数

先看 M2 backbone 的基本配置。

项目	报告口径
架构	decoder-only Transformer + MoE
总参数	229.9B
每 token 激活参数	9.8B
层数	62
hidden size	3072
vocab size	200064
MoE experts	256 fine-grained experts
每 token 激活专家数	8
attention	full attention
query heads	48
KV heads	8, GQA
position embedding	RoPE
native context	192K tokens
pre-training tokens	29.2T

这个配置体现出一个明确取舍：总参数很大，但每 token 激活参数控制在 10B 左右。agent task 的 token 消耗通常较高，多轮工具调用、长上下文、观察结果、文件内容都会增加 context 长度。因此，每 token 激活成本会直接影响推理和 RL rollout 成本。

因此，M2 的参数信息需要同时看 total params 和 activated params：229.9B total params / 9.8B activated params。

MoE 结构

M2 的 FFN 层使用 MoE。报告里强调了三个设计：

fine-grained experts；
sigmoid gating；
expert bias。

fine-grained experts 的意思是使用更多、更小的 experts。M2 使用 256 个 experts，每 token 激活 8 个。这样增加了 expert 组合的多样性，也可以降低不同设备之间 expert utilization 的方差。

sigmoid gating 和常见 softmax top-k gating 的区别在于：softmax 有 zero-sum constraint，一个 expert 得分高，其他 expert 得分就会相对被压下去；sigmoid 则是每个 expert 独立打分。报告认为这样可以让多个 expert 同时以较高置信度被激活，routing dynamics 更平滑。

expert bias 则是在 gating score 上加入 per-expert learnable bias，用来改善 load balancing，并减少对 auxiliary load-balancing loss 的依赖。

报告给了一个小规模消融实验：17.8B total params，2B activated params，500B training tokens。

配置	MATH	MMLU	ARC-C	KorBench	HumanEval
baseline	19.6	39.8	27.4	14.1	29.7
+ MTP	21.3	39.7	27.5	15.0	30.1
+ fine-grained experts	24.1	40.2	27.8	14.8	32.5

这里 HumanEval 和 MATH 的提升比较明显。

Attention: full attention

M2 使用 full attention，而不是 hybrid SWA、linear attention 或 sparse attention。

这里的 full attention 需要和 causal attention 分开理解。M2 是 decoder-only language model，所以 attention 仍然是 causal 的：当前位置只能看过去和当前位置，不能看未来。full attention 指的是在 causal mask 下，每个 token 可以看完整历史 token，而不是只看一个局部 sliding window。

也就是：

causal full attention:
token t can attend to token 1 ... t

causal sliding-window attention:
token t can attend to token max(1, t - W) ... t

报告提到，MiniMax 对 hybrid SWA 做了多组继续预训练实验，包括 SWA/full attention 比例、RoPE 设置、layer 内和 layer 间混合、sink token 等。但在 retrieval、多跳推理、in-context learning 和长上下文 agent task 上，SWA variant 有明显损失。

预训练阶段的一组对比：

Benchmark	full attention	hybrid SWA
HELMET ICL	75.8	72.7
MMLU	85.5	85.6
MATH	60.3	60.3
RULER 128K CWE	90.0	72.0
RULER 128K MQ	99.0	93.0
RULER 32K CWE	99.0	99.0
RULER 32K MQ	99.0	99.0
MTOB K-e Bleurt	60.0	45.0
MTOB e-k ChrF	44.8	27.2

从这组数据看，32K 内部分任务中 SWA 与 full attention 差距不大；到 128K retrieval 和 long-context ICL 时，局部窗口的覆盖限制开始体现出来。

对于 agent 任务，长上下文通常包含任务描述、工具输出、失败尝试、文件内容、之前的 reasoning state 等信息。如果 attention 机制不能可靠访问完整历史，后续 planning 和 self-correction 会受到影响。

MTP

M2 使用 Multi-Token Prediction。预训练阶段先使用单个 MTP module，K = 1，MTP loss weight 从 0.3 anneal 到 0.1。

在 continued pre-training 的 decay phase，M2 把 MTP module 扩展到 K = 3，用于 multi-step speculative decoding。扩展时不是随机初始化，而是从 main model 复制权重初始化。报告里的解释是：

copy initialization 收敛更快；
随机初始化会带来较高 loss，并短暂干扰 main model；
先冻结 main model 训练 MTP modules，loss 稳定后再 joint training。

所以 MTP 在 M2 里有两层作用：

训练阶段提供更丰富的预测信号；
推理阶段作为 speculative decoding 的 draft path。

这个点和 Forge 也有关。RL rollout 期间 policy 在持续更新，如果 MTP draft model 不跟着适配，acceptance rate 会下降。报告提到 Forge 里 MTP modules 会通过 top-K KL divergence loss 和 RL policy 一起 co-train，从而维持 speculative decoding 的效果。

Pre-training 数据

预训练总量是 29.2T tokens。

报告把它分成：

constant phase: 19.9T tokens；
decay phase: 9.3T tokens。

数据来源包括 web documents、academic literature、books、programming code、structured QA。code、math、STEM 会相对自然分布上采样。

长上下文扩展是多阶段完成的：

1	8K -> 32K -> 192K

长上下文数据主要来自：

high-quality code concatenation；
naturally long-form PDF documents；
thematically related document packing。

从报告结构看，M2 的 pre-training 不只服务于通用 base model 能力，也为后续 agent post-training 提供 192K context backbone。

Post-training 数据

报告真正花篇幅的地方其实是 post-training data collection。

M2 系列的 post-training data 不是普通 chat 数据，而是大量带 workspace、tool、environment、verifiable reward 或 artifact-aligned feedback 的 agent trajectories。

主要分几类：

Agentic Coding；
Agentic Cowork；
Reasoning-intensive tasks；
General conversation and writing；
Role-play and persona coherence。

但是这里有一个需要注意的点：报告没有披露 post-training 的具体数据规模。

也就是说，论文写了很多 pipeline 和数据类型，但没有给出类似下面这样的硬数字：

SFT tokens 数；
RL tokens 数；
agent trajectories 数；
SWE/AppDev/Terminal/Cowork 各域样本量；
rejection sampling 前后 pass rate；
每个 stage 的 domain mixing ratio。

报告使用的是 large-scale、corpus、trajectories、at scale 这类描述。能看到的硬规模数字主要集中在 pre-training：29.2T tokens，其中 constant phase 19.9T tokens，decay phase 9.3T tokens。

因此，post-training 部分可确认的信息是：M2 披露了 post-training 数据构造方法和验证信号，但没有披露各类 post-training data 的样本规模和 token budget。

Agentic Coding

Agentic Coding 又分 SWE、AppDev、Terminal-Gym。

SWE pipeline 从 GitHub PR 和 issue 出发，过滤 merged PR、有测试的 PR，再由 agent 构建 Docker environment。之后按 PR 类型做 task routing，比如 bug fix、feature addition、performance optimization、test/refactor 等。

这里最重要的是 reward construction。不同任务类型用不同的验证信号：

bug fix: F2P / P2P tests；
feature addition: newly added test points；
performance optimization: stable and significant performance difference；
code review: secondary LLM consistency check。

AppDev 则是从零构建应用。它用 expert-in-the-loop 生成 meta queries 和 system prompts，再通过 Agent-as-a-Verifier 做 rejection sampling。AaaV 分三层验证：

execution layer: 文件、依赖、构建、服务启动、JS error；
interaction layer: Playwright 检查核心交互；
visual aesthetics layer: 布局、层次、配色、现代 UI 质量。

Terminal-Gym 则从 Stack Overflow 出发，筛选 terminal-compatible、scriptable、verifiable、Docker-relevant 的任务，然后生成 Dockerfile 和 test script，再做 query evolution 和 difficulty calibration。

这几个 pipeline 的共同点是：数据不只包含最终答案，还包含可执行环境，使结果能够被测试、运行或交互验证。

Agentic Cowork

Cowork 覆盖的东西更接近知识工作者任务：

deep search and open-web research；
knowledge-worker office tasks；
financial analysis and spreadsheet operations；
slide generation and editing。

这部分的 reward 更复杂。有些任务可以 deterministic check，比如 spreadsheet cell value match、formula recalculation；有些任务需要 rubric-based judge，比如 report、slides、open-ended financial reasoning。

这个方向是 M2.7 相比 M2.5 提升很明显的地方。后面的指标表里，GDPval-AA、MEWC v2、Finance Modeling Pro 都涨得很大。

Reasoning 和 general data

Reasoning data 主要强调 scaling：

query-side scaling: 扩展题目覆盖；
response-side scaling: 一个 query 采多个正确解法；
training-side scaling: 在固定计算预算下调 query expansion / response expansion 比例；
QA: query、verifier、answer、response 多阶段质量控制。

General conversation and writing 则用于保持通用对话、写作、多轮理解，以及 tool-augmented / tool-free 两种能力。

SFT: interleaved thinking

M2 的 SFT 目标之一是训练 interleaved thinking。

普通 long CoT 往往是：

1	thinking -> final answer

Agent 场景更像：

1	thinking -> tool call -> observation -> thinking -> tool call -> observation -> final answer

报告把它称为 Plan-Act-Reflect loop。关键点是 reasoning state persistence：前一轮的 thinking block 会保留在 history 里，进入下一轮上下文。

这和 stateless per-turn reasoning 不一样。如果每轮工具调用后都把之前的 reasoning 丢掉，模型就要反复重新推导状态，长程任务里容易 drift。

这个地方也能解释为什么 M2 更坚持 full attention：如果 thinking、action、observation 都保留在 192K context 里，模型必须能可靠地访问完整历史。

Forge

是什么

Forge 是 M2 系列的 agent-native RL training system。

它不是模型结构，也不是单个 RL 算法，而是一套让长程 agent trajectories 能进入 RL 训练闭环的工程系统。

普通 RLHF 或 response-level RL 常见输入是：

1	prompt -> response -> reward

Agent RL 的输入则像：

task
-> reasoning
-> tool call
-> observation
-> reasoning
-> tool call
-> observation
-> artifact / test result / final answer
-> reward

轨迹可能有几十 K 到 192K tokens，中间还有工具调用、Docker、browser、spreadsheet、file system、context management、sub-agent delegation。这要求训练系统处理长序列、异步环境和复杂状态转移。

Forge 的系统拆分为：

Agent Side
  run agent loop
  manage context
  call tools / envs
  produce trajectories

Middleware
  Gateway Server
  Data Pool

Training / Inference Side
  Rollout Engine
  Train Engine

Agent Side 只负责跑任务和记录轨迹。Rollout Engine 负责高吞吐生成。Train Engine 负责策略更新。Gateway 和 Data Pool 把二者解耦。

一个重要抽象是：Forge 把 LLM generation interface 作为 policy 边界，边界之外的 tool execution、context management、memory access、agent harness 都视为 environment dynamics。

这样它就可以支持两类 agent：

white-box agent: 训练系统知道它怎么做 context management；
black-box agent: 训练系统只看到每次发给模型的 state 和模型输出。

black-box support 很重要。因为真实 agent scaffold 往往各有各的 context rewrite、memory、sub-agent、tool protocol。如果要求所有 agent 都白盒接入，系统扩展性会很差。

RL 算法和 reward

M2 系列 RL 使用 CISPO，Clipped Importance Sampling Policy Optimization。报告里给了 objective 和 importance sampling ratio 的 clipped form。

我先不展开公式，抓几个工程上更重要的点：

第一，训练样本是 (state, action) pair。一个 action 是一次 LLM completion，可以包含 reasoning、tool invocation、context operation、sub-agent communication 等。

第二，credit assignment 仍然按完整 episode 做。也就是说，每个 step 的 advantage 要结合整条 trajectory 的结果。

第三，reward 不是只有最终 outcome，而是 composite reward：

1
2
3

r_t = alpha * process_reward_t
    + beta * speed_reward_t
    + performance_reward_t

process reward 用于中间行为，比如工具调用格式、语言混杂、reasoning 结构。

speed reward 用于 wall-clock completion time。这个点很 agent：两个 trajectory 都完成任务，但一个串行慢慢跑，一个并行调用工具，后者对产品更有价值。

performance reward 则是最终任务质量，比如测试通过、artifact 正确、rubric score。

第四，使用 mixed-domain RL。每个 stage 同时混合 reasoning、coding、agent、general 四类数据。这样可以降低单一 agent task RL 造成的 catastrophic forgetting。

Windowed FIFO

Agent rollout 有一个很现实的问题：任务完成时间差异巨大。

1
2
3

simple API task: seconds
coding task: minutes
ML engineering task: hours

如果严格 FIFO，训练会被长任务卡住。
如果谁先完成就训练谁，前期 batch 会被短任务和简单任务主导，后期才出现长任务和难任务，训练分布会漂。

Forge 用 Windowed FIFO 做折中：

1 2	queue = [T0, T1, T2, T3, T4, T5, ...] window = [T0, T1, T2, T3]

只允许训练系统消费 window 内已经完成的 trajectories。window 内可以乱序取，window 外即使已经完成也不能提前进入训练。

论文里举的窗口大小例子是 W = 0.3N。这个策略牺牲一点点绝对吞吐，换来更稳定的数据分布。

Prefix tree merging

Prefix tree merging 是 Forge 里最有工程味的优化之一。报告称它最高能带来 40x training speedup，并降低显存占用。

Agent RL 训练中，很多 samples 共享长前缀。例如同一个 rollout group 里可能有：

1
2
3

sample 1 = long context + response A
sample 2 = long context + response B
sample 3 = long context + response C

或者同一条 agent trajectory 被拆成多个 step：

1
2
3

s1 -> a1
s2 = s1 + a1 + obs1 -> a2
s3 = s2 + a2 + obs2 -> a3

如果每个 sample 独立 forward，long context 或历史轨迹会被反复计算。

Prefix tree merging 把这些序列组织成一棵 prefix tree：

long shared context
├── response A
├── response B
└── response C

或者：

context
└── action1 + obs1
    └── action2 + obs2
        └── action3 + obs3

共享 prefix 只 forward 一次。到分叉点后，再分别计算 branch。forward 结束后，根据元数据把 tree 拆回原始 sample，loss 仍然按 sample 独立计算。

它成立的原因是 causal attention：prefix token 的 hidden states 不依赖后续 branch tokens。后面的 token 可以看前面，前面的 token 不会看后面。

概念伪代码：

samples = [
    ctx + resp_a,
    ctx + resp_b,
    ctx + resp_c,
]

tree = build_prefix_tree(samples)

def forward_node(node, parent_state):
    state = model_forward_segment(
        tokens=node.tokens,
        parent_state=parent_state,
    )

    for child in node.children:
        forward_node(child, state)

forward_node(tree.root, None)

loss = 0
for sample in samples:
    logits = reconstruct_logits(sample, tree)
    loss += compute_loss(logits, sample.labels)

loss.backward()

真实实现会复杂很多，要处理 attention mask、position ids、loss mask、MoE routing、activation checkpointing、分布式并行和 backward graph。但核心就是把训练 batch 从独立 sequence list 改成共享前缀树。

这个优化和普通 sequence packing 不同。sequence packing 主要减少 padding；prefix tree merging 则是避免重复计算公共历史。对于 192K context 的 agent RL，这类重复计算会带来较高开销。

Rollout 侧推理优化

Forge 还做了几类 inference acceleration。

MTP speculative decoding

M2 的 MTP modules 可以生成 draft tokens，再由 main model 验证。RL 期间 policy 会更新，所以 MTP modules 也要跟着 co-train，否则 draft acceptance rate 会下降。

Prefill-decode disaggregation

把 prefill 和 decode 分开调度。MoE 模型里 prefill 和 decode 的计算形态不同，混在一起容易互相干扰。拆开后可以分别采用更适合的 parallelism 策略。

Global L3 KV cache pool

Agent 多轮交互里有大量共享 prefix。Forge 使用分布式 KV cache pool，提高 prefix cache hit rate。router 会在 queue delay 和 cache migration cost 之间做权衡。

从报告描述看，Forge 里的 rollout engine 不只是离线采样服务，而是包含长上下文、KV cache、MoE serving、speculative decoding、prefill/decode separation、多版本权重同步等能力的推理系统。

训练系统性能

报告明确给出的硬数字是：prefix tree merging 最高可以达到 40x training speedup，同时降低 memory consumption，使更长 sequence 和更大 batch size 成为可能。

其他 Forge 优化更多是定性描述，比如：

Windowed FIFO 用于在 rollout throughput 和 distributional consistency 之间折中；
MTP speculative decoding 用于提升 rollout generation throughput；
prefill-decode disaggregation 用于提升 global throughput 并降低 tail latency；
global L3 KV cache pool 用于提升 prefix cache hit rate。

但是论文没有给出这些优化各自的 ablation 表，比如没有列出 Windowed FIFO 前后吞吐、GPU utilization、tail latency、KV hit rate、MTP acceptance rate 等数字。

实现架构推断

下面是基于报告描述的工程推断，不是官方源码，也不是论文披露的完整实现。


sequenceDiagram
participant T as Task Queue
participant A as Agent Runner
participant E as Tool / Env Servers
participant G as Gateway Server
participant R as Rollout Engine
participant D as Data Pool
participant Tr as Train Engine

T->>A: task / env spec
loop agent rollout
  A->>G: completion request<br/>state + tools + metadata
  G->>R: normalized request<br/>model version attached
  R-->>G: completion<br/>tokens + logprobs
  G-->>A: action / tool call
  A->>E: execute tool call
  E-->>A: observation / artifact
  A->>D: state / action / observation
  E->>D: verification / reward signal
  R->>D: model version / logprobs
end
D->>D: Windowed FIFO<br/>filtering / batching
D->>Tr: training batch
Tr->>Tr: prefix tree merging<br/>CISPO update
Tr-->>R: updated weights

图里的关键边界是 Gateway：Agent Side 可以保持 scaffold 差异，Training / Inference Side 则通过统一的 completion 接口接收请求、记录元数据，并把 rollout 数据回流到 Data Pool。

Prefix tree merging 可以单独画成下面这个形态：


flowchart LR
b1["before: ctx + a"]
b2["before: ctx + b"]
b3["before: ctx + c"]

ctx["after: shared ctx"]
a["branch a"]
b["branch b"]
c["branch c"]

b1 -. same prefix .-> ctx
b2 -. same prefix .-> ctx
b3 -. same prefix .-> ctx

ctx --> a
ctx --> b
ctx --> c

共享 prefix 只做一次 forward，分叉后的 response segment 分别计算；forward 结束后再根据元数据还原到原始 sample 计算 loss。

根据 Forge 的训练需求，Data Pool 可能需要记录这些字段：

trajectory_id
task_id
domain
model_version
states
actions
observations
token_ids
old_logprobs
process_rewards
final_reward
wall_clock_time
tool_calls
artifact_paths
verification_result

Train Engine 的处理流程可以抽象为：

trajectories = data_pool.fetch_windowed_fifo_batch()

samples = []
for traj in trajectories:
    for step in traj.steps:
        samples.append({
            "input_ids": step.state_tokens,
            "target_ids": step.action_tokens,
            "old_logprobs": step.old_logprobs,
            "advantage": compute_advantage(traj, step),
        })

batch = prefix_tree_merge(samples)
loss = cispo_loss(batch)

loss.backward()
optimizer.step()

rollout_engine.sync_weights(model)

这里的关键是 old logprobs 和 model version。RL rollout 和 training 之间一定存在 policy lag，所以需要知道 trajectory 是哪个旧 policy 采样出来的，再通过 importance sampling ratio 做修正。

性能数据

M2 系列报告里其余性能数据分布比较散，和 Forge 训练系统性能分开看更清晰。

Agent task 运行设置

这些不是系统吞吐，但可以反映任务成本：

agent trajectories 最长可到 192K tokens，并可能包含 thousands of intermediate actions；
rollout completion time 从 seconds 到 hours；
Terminal-Bench 2.0 使用 8 vCPU / 16GB sandbox，2 小时 wall-clock timeout，4 trials；
MLE Bench Lite 对 22 个 competitions 运行，每个 competition 在 single-A30 sandbox 中跑 24 小时，最终取 3 个 independent 24-hour trials 的平均 medal rate；
VIBE-Pro、HyperTask、MM Claw、MEWC v2、Finance Modeling Pro 等多项 agent / artifact benchmark 使用 3 trials。

Self-evolution 内部效率

M2.7 在 RL team workflow 中吸收 30% 到 50% 的 daily iteration workload；
对内部 programming scaffold 做 100-round autonomous iteration；
引入 loop detection 和更好的参数组合后，内部评估有 30% performance gain。

这部分属于内部系统和内部评测，不是外部可复现 benchmark。

部署侧推理性能

来自 NVIDIA 技术博客，而不是 M2 论文主体。NVIDIA 提到在 Blackwell Ultra GPU 上，针对 MiniMax M2 系列在 vLLM / SGLang 集成 QK RMSNorm kernel 和 FP8 MoE kernel 后，在 1K/1K ISL/OSL dataset 上：

vLLM throughput 最高提升 2.5x；
SGLang throughput 最高提升 2.7x。

这个数据属于部署工程部分，不应和 Forge 训练系统性能混在一起。Forge 是 post-training / RL infrastructure；NVIDIA 这里讲的是 open-source inference framework 的 serving optimization。

指标

论文 Table 4 给了 M2.7、M2.5 和几个闭源 frontier baseline 的对比。这里只摘 M2.7 和 M2.5。

Benchmark	M2.7	M2.5
SWE-bench Pro	56.2	55.4
SWE-bench Multilingual	76.5	74.1
Multi-SWE-bench	52.7	51.3
NL2Repo	39.8	26.6
Terminal-Bench 2.0	57.0	51.7
MLE Bench Lite	66.6	51.5
VIBE-Pro	55.6	54.2
HyperTask	67.6	59.4
BrowseComp	77.8	76.3
Wide Search	75.2	70.3
RISE	64.3	50.2
GDPval-AA	50.0	35.0
Toolathlon	46.3	38.3
MM Claw	62.7	57.6
MEWC v2	63.3	49.8
Finance Modeling Pro	57.0	33.8
AIME 2026	94.2	87.2
GPQA-Diamond	89.8	85.2
SciCode	47.0	43.0
IFBench	76.0	72.0
AA-LCR	72.0	65.0
HLE	28.0	19.0
MMLU-Pro	81.8	85.2

从表中可以看到：

M2.7 相比 M2.5 的大幅提升集中在 agent / cowork / office / MLE；
Finance Modeling Pro 从 33.8 到 57.0；
GDPval-AA 从 35.0 到 50.0；
MEWC v2 从 49.8 到 63.3；
MLE Bench Lite 从 51.5 到 66.6；
MMLU-Pro 从 85.2 降到 81.8。

这说明 M2.7 不是所有传统静态知识 benchmark 都提升。报告更强调的是：agent data pipeline 和 Forge RL 对真实 workflow benchmark 的提升。

M2.7 的 self-evolution

报告里还提到 M2.7 的 self-evolution。

MiniMax 的说法是，M2.7 可以在内部 Model Iteration System 里帮助 RL 团队：

profile ongoing runs；
read logs；
diagnose metric anomalies；
debug code；
adjust configs；
generate reports；
modify agent scaffold。

报告称它可以吸收 RL 团队日常 30% 到 50% 的 iteration workload。另一个例子是，M2.7 对内部 programming scaffold 做了 100-round autonomous iteration，引入 loop detection 和更好的参数组合，在内部评估上带来 30% performance gain。

这部分高度依赖内部工作流和内部评测，应视为官方披露的内部案例，而不是外部可复现实验结论。

总结

M2 系列报告的重点不只是单个 benchmark，而是一套完整路线：

low-activation MoE backbone
-> long-context full attention
-> MTP for training signal and speculative decoding
-> verifiable agent trajectory data
-> interleaved thinking SFT
-> Forge agent-native RL
-> rollout / training / serving co-optimization

如果只看参数，M2 是一个 229.9B total / 9.8B activated 的 MoE 模型。
如果看训练，它是一个围绕可验证 agent trajectories 做 post-training 的模型。
如果看工程，Forge 才是这篇报告里很关键的东西：它把 agent loop、推理服务、轨迹存储、reward、RL trainer 和权重同步接成一个系统。

这也是 M2 系列和很多只讲模型结构的技术报告不同的地方。它把模型能力放在完整 agent workflow 里讲，重点不是“模型会不会回答”，而是“模型能不能在环境里把事情做完，并且这个训练闭环能不能规模化”。

Ray Direct Transport (RDT)

发表于 2026-05-23 分类于笔记本文字数： 3.7k 阅读时长 ≈ 3 分钟

Ray 默认把 object 放进 Plasma object store——每个节点上一个基于 共享内存 的本地 store 进程，worker 通过它读写 object。ray.put()、task / actor 返回值等先落本机 Plasma；跨节点时再由 Ray 的 ownership / scheduling 层协调 fetch，但 各节点本地内存层仍是 Plasma。内存不够时会 spill 到磁盘（默认在 session 临时目录下），需要时再 restore 回 Plasma；ray memory 里的 Plasma memory usage 就是这一层。

task / actor 消费 object 时要反序列化。对 CUDA torch.Tensor 来说，默认路径意味着 GPU → CPU（进 Plasma）→ GPU 的来回拷贝，在 actor 间频繁传 tensor 时开销很大。


graph LR
subgraph N1["节点 1"]
  A1["Producer Actor<br/>GPU tensor"]
  B1["Plasma Store"]
end
subgraph N2["节点 2"]
  A2["Consumer Actor"]
  B2["Plasma Store"]
end
A1 -->|"GPU to CPU, 入 Plasma"| B1
B1 -->|"跨节点 fetch"| B2
B2 -->|"拷贝到 GPU"| A2
A1 -->|"RDT 使用 Gloo/NCCL/NIXL<br/>在 actor 间 send/recv"| A2
style B1 fill:#eee,stroke:#ccc,color:#999
style B2 fill:#eee,stroke:#ccc,color:#999
style A1 fill:#ddeeff,stroke:#338
style A2 fill:#ddeeff,stroke:#338

上：传统 Plasma 路径需多次 CPU/GPU/内存拷贝；下：RDT 经 Gloo/NCCL/NIXL 在 actor 间 send/recv，绕开 Plasma。

Ray Direct Transport (RDT) 是在 ObjectRef 语义上做的增强：tensor 留在 producer actor 侧（GPU 上），consumer 需要时由 Ray 协调两端做 send/recv，绕开 Plasma object store 的序列化与拷贝。底层可选 Gloo / NCCL / NIXL——Gloo、NCCL 是 collective 库，需先建 collective group，再在 group 内走 p2p 传输；NIXL 则是基于 UCX 的 p2p RDMA，无需预建 group，且 ray.get 可走 one-sided 取回。

RDT 目前仍是 alpha，API 和限制都可能变；下文基于 Ray 2.55 文档整理。

基本用法

在返回 torch.Tensor 的 actor method 上加 @ray.method(tensor_transport=...)：

import torch
import ray
from ray.experimental.collective import create_collective_group

@ray.remote
class MyActor:
    @ray.method(tensor_transport="gloo")
    def random_tensor(self):
        return torch.randn(1000, 1000)

    def sum(self, tensor: torch.Tensor):
        return torch.sum(tensor)

sender, receiver = MyActor.remote(), MyActor.remote()
group = create_collective_group([sender, receiver], backend="torch_gloo")

tensor = sender.random_tensor.remote()
result = receiver.sum.remote(tensor)
print(ray.get(result))

decorator 只加在产出 tensor 的方法上，消费方不用加（除非它也要返回 RDT tensor）。
tensor 存在 producer actor 里，不是 Plasma object store。
传给另一个 actor 时，Ray 自动用指定 transport 做 send/recv。
返回值若未标注 RDT，仍走默认 Plasma object store（上例 sum 的标量结果）。

嵌套结构、多 tensor 返回值也支持，Ray 会递归识别其中的 torch.Tensor。

三种 transport

transport	场景	collective group	备注
`gloo`	CPU tensor	需要，`backend="torch_gloo"`	无 GPU 也能跑通 demo
`nccl`	NVIDIA GPU	需要，`backend="nccl"`	actor 需 `num_gpus=1`，tensor 在 `.cuda()`
`nixl`	CPU / GPU	不需要	基于 UCX 的 p2p RDMA；`ray.get` / `ray.put` 也可走 NIXL

Gloo / NCCL 是 collective 语义，使用前必须 create_collective_group，且 backend 与 tensor_transport 一致。NIXL 更灵活，actor 环境装好 nixl 即可，适合跨节点 p2p。

NCCL 版几乎就是 Gloo 版三处替换：tensor_transport="nccl"、backend="nccl"、tensor 放 GPU。

NIXL 额外支持 driver 侧 ray.put(t, _tensor_transport="nixl")，以及 consumer 内 ray.get(ref) 直接经 NIXL 取回。

collective transport 的 ray.get 若 caller 不在 group 里会报错，需配置 _use_object_store=True 回退。

与 Plasma object store 的语义差异

RDT object 是可变的。 Ray 只持有 tensor 引用，不做 immutable copy。producer 若仍持有同一块 tensor 并在 in-place 修改，后续 consumer 可能看到被改过的数据。这与 Ray Core 默认「actor 返回即拷贝」的行为不同。

传回 同一个 producer actor 时零拷贝，只是引用；若同时再传给别的 actor，in-place 修改会影响 Ray 内部持有的那份，Ray 会打印 warning。

需要 producer 再次写同一块 tensor 时，用 ray.experimental.wait_tensor_freed(tensor) 等 Ray 释放所有引用；注意此时 driver 不要再 ray.get 持有该 ref，否则会死锁。

限制

当前 alpha 状态

仅 torch.Tensor，仅 Ray actor（不含普通 task）。
不支持 asyncio（tracking issue）。
Gloo / NCCL：
- 只有 创建 collective group 的进程 能提交返回 / 传递 RDT object 的 actor task。
- RDT ObjectRef 不能序列化后跨进程传递，只能作为 同 group 内 actor task 的直接参数。
- 每个 actor 在同一 transport 下同时只能属于一个 group。
- 不支持 ray.put。
NIXL：同一 actor 上若先后存两个 object、tensor 集合有重叠但不完全相同，当前有已知问题；需等第一个 ObjectRef 出 scope 后再存第二个。

系统级传输错误：Gloo/NCCL collective 失败会 销毁 group 并 kill actor；NIXL 会 abort 并在依赖 task / ray.get 处抛异常。超时可调 RAY_rdt_fetch_fail_timeout_milliseconds。

与 RL 训推 infra 的关系

RL 里 actor 间传 rollout buffer、logits、hidden states 若走默认 Plasma object store，GPU 数据会被反复拉到 CPU。RDT 把这条路径收成 actor 间 direct transport，和 NCCL collective、NIXL RDMA 对齐，适合 多 actor 流水线（例如 rollout actor → trainer actor）且 tensor 较大的场景。但 alpha 阶段的 collective group 创建进程限制、可变语义、以及仅 actor 支持，使用前要先对照 workload 评估是否适用。

参考

model-ds-series

发表于 2026-05-08 更新于 2026-05-25 本文字数： 2.2k 阅读时长 ≈ 2 分钟

想着在这个时间点上回顾下模型和模型训练 infra 发展的经历，就以 DS 的技术报告为例吧。

DS 系列模型：

DeepSeek LLM — DeepSeek LLM 7B、DeepSeek LLM 67B Dense（2024/01/05） arXiv:2401.02954
DeepSeek-Coder — 1.3B / 6.7B / 33B（2024/01/25） arXiv:2401.14196
DeepSeekMoE — MoE 语言模型系列 2B / 16B / 145B 等（2024/01/11） arXiv:2401.06066
DeepSeekMath — DeepSeekMath-7B（2024/02/05） arXiv:2402.03300
DeepSeek-V2 — 第二代 MoE 通用大模型（2024/05/07） arXiv:2405.04434
DeepSeek-V2.5 — 通用与代码能力合流迭代（2024/09/06；无单独 arXiv 技术报告，架构见 V2）官方说明
DeepSeek-V3 — 第三代 MoE 通用大模型（2024/12/27） arXiv:2412.19437
DeepSeek-R1、DeepSeek-R1-Zero（2025/01/22） arXiv:2501.12948

继而再看框架的实现。不是之前不能做，之前这是个浩大的工程。现在借助模型不仅是代码门槛下降了，理解 sota 工作的门槛也下降了，可以抽空广泛的了解起来了，成为新时代的 “全栈” 工程师。

DeepSeek LLM

LLaMA

2T tokens pre train

1M sft, RLHF, SFT → RM → PPO 这条经典 RLHF pipeline

DPO, 不显式训一个单独的 RM、也不做 RL 循环，直接用 偏好对 数据（同一条 prompt 下，人类更喜欢回答 A 而不是 B）去更新语言模型。

SFT -> DPO

模型架构

a Pre-Norm structure with RMSNorm (Zhang and Sennrich, 2019) function

using SwiGLU (Shazeer, 2020) as the activation function

Rotary Embedding (Su et al., 2024) for positional encoding

Grouped Query Attention (GQA)

AdamW optimizer (Loshchilov and Hutter, 2017), with the following hyperparameters: 𝛽1 = 0.9, 𝛽2 = 0.95, and weight_decay = 0.1

HAI-LLM = Megatron 式多并行 + Flash Attention + ZeRO-1 省优化器显存 + 通信计算重叠 + 融合 kernel；数值上用 bf16 算、fp32 攒梯度保稳；最后在 softmax/CE 上用 in-place 省 logits 显存。

DeepSeek-Coder

代码生成和代码补全，专门进行 FIM 代码补全训练

employ HuggingFace Tokenizer library 使用 Byte Pair Encoding 技术在训练语料子集上进行训练得到

基于 DeepSeek LLM 模型架构和训练技术，decode-only transformer, RoPE, GQA, FlashAttention v2

AdamW 优化器

并行策略实现仍然用的自研的 HAI-LLM 框架

超参数 (Hyperparameter)	DeepSeek-Coder 1.3B	DeepSeek-Coder 6.7B	DeepSeek-Coder 33B
隐藏层激活函数 (Hidden Activation)	SwiGLU	SwiGLU	SwiGLU
隐藏层维度 (Hidden size)	2048	4096	7168
中间层维度 (Intermediate size)	5504	11008	19200
隐藏层数 (Hidden layers number)	24	32	62
注意力头数 (Attention heads number)	16	32	56
注意力机制 (Attention)	Multi-head	Multi-head	Grouped-query (8)
批次大小 (Batch Size)	1024	2304	3840
最大学习率 (Max Learning Rate)	$5.3 \times 10^{-4}$ ( $5.3\text{e-}4$ )	$4.2 \times 10^{-4}$ ( $4.2\text{e-}4$ )	$3.5 \times 10^{-4}$ ( $3.5\text{e-}4$ )

DeepSeek-Coder 主系列（1.3B / 6.7B / 33B）：from scratch pt 2T -> Base（技术报告：pre-training 含 FIM、context 扩至 16K；DeepSeek-Coder 仓库 README Model Training 两阶段 1.8T@4K + 200B@16K）；SFT 2B -> Instruct

DeepSeek-Coder-v1.5（仅 7B）：CPT from DeepSeek-LLM 7B，2T@4K（next-token，无 FIM/16K）

DeepSeekMoE

更小（多）的专家 + 共享专家

2B 实验

DeepSeekMath

DeepSeek-V2

DeepSeek-V2.5

DeepSeek-V3

DeepSeek-R1

PPO

发表于 2026-04-18 更新于 2026-05-05 本文字数： 16k 阅读时长 ≈ 14 分钟

1. PPO

Proximal Policy Optimization Algorithms

两阶段循环

为什么可以“多轮”

通常情况下，如果对同一批数据进行多轮优化，策略会因为更新过头而崩溃。但 PPO 引入了 Clipped Objective（裁剪目标函数）：

安全护栏：在每一轮优化中，PPO 会计算新策略和采样时的旧策略的概率比。如果这个比值超出了设定的范围（比如 $0.8 \sim 1.2$ ），梯度就会被“截断”。
效果：这确保了即使在这一批数据上反复“薅羊毛”优化，新策略也不会跑得离旧策略太远，从而保证了训练的稳定性。

1.1. 采样阶段 (Sampling Phase)

动作：让当前的策略 $\pi_{\theta_{old}}$ 在环境中运行一段时间。
产出：收集一批轨迹数据（包括状态 $s$ 、动作 $a$ 、奖励 $r$ 等）。
性质：这些数据是“新鲜”的，反映了当前策略的行为模式。

在这个阶段，神经网络的参数是固定不动的（即 $\theta_{old}$ ）。Actor (策略网络)：在环境中根据概率分布选择动作。数据收集：把 $(s_t, a_t, r_t, s_{t+1})$ 存入一个临时的 Buffer。目标：收集足够数量的轨迹（比如 2048 个时间步）。

1.1.1. 计算“标签” (Preprocessing)

在开始训练前，利用收集到的数据计算两个关键值：

$\hat{A}_t$ (Advantage)：优势函数，用来衡量这个动作比平均水平好多少。
$R_t$ (Returns)：这一步动作带来的累积奖励。

注意到

$r_t(\theta)$ ：新旧策略概率比（用于 Actor）。
$\hat{A}_t$ ：优势估计（用于 Actor，决定更新方向）。
$R_t$ ：回报目标值（用于 Critic，提升估值精度）。

如果只用即时奖励 $r_t$ 作为目标，Critic 就会变得非常“短视”。即时奖励 $r_t$ ：只代表当前这一步的好坏。回报目标 $R_t$ ：代表从当前时刻起，在轨迹剩余部分上累计（折现）后的总回报。目标是让 Critic 具备“向前看”的能力，因此用 $V(s_t)$ 去拟合这个 $R_t$ 。

R_t = \hat{A}_t + V(s_t)

$R_t$ (Returns)：作为 Critic 网络的监督信号（标签）。

计算逻辑：通过 $\hat{A}_t$ （优势）与采样时旧的 $V(s_t)$ 相加得到： $R_t = \hat{A}_t + V(s_t)$ 。
物理意义：它代表了在当前策略下，从状态 $s_t$ 开始预期能获得的折现总奖励。Critic 的优化目标就是让预测值 $V_\theta(s_t)$ 尽可能接近这个 $R_t$ 。

这意味着：

先用 GAE 算出了优势估计 $\hat{A}_t$ 。
通过 $\hat{A}_t + V(s_t)$ ，便可反向推导出这一步动作对应的“目标回报” $R_t$ 。
价值损失 (Value Loss) 就变成了： $MSE(V_{new}(s_t), R_t)$ 。

1.1.2. 小结

Actor：利用 $\hat{A}_t$ （相对好坏）来决定 $\theta$ 的更新方向。
Critic：利用 $R_t$ （绝对得分）来修正自己对世界的认知。

1.2. 优化阶段 (Optimization Phase)

多轮优化 (Several Epochs)：

动作：将刚才采样的这一批数据反复输入神经网络进行多次梯度更新。
关键点：在传统的 On-policy 算法（如普通的策略梯度）中，这批数据更新一次就必须扔掉。但 PPO 允许在同一批数据上跑 3 轮、5 轮甚至 10 轮（Epochs）。

要把 Buffer 里的数据，分成更小的 Mini-batches，重复训练 $K$ 个 Epochs（比如 $K=10$ ）。在每一个 Epoch 里的微观操作：计算概率比 $r_t(\theta)$ ：用当前正在更新的 $\theta$ 计算动作概率，除以采样时的 $\theta_{old}$ 计算的概率。应用裁剪 $CLIP$ ：如果 $r_t(\theta)$ 偏离 1 太远（比如超过 20%），就强行截断。梯度更新：通过反向传播更新参数 $\theta$ 。

为什么 $r_t(\theta)$ 允许“多轮更新”。PPO 能够从 On-policy 转向近乎 Off-policy 的理论支柱，PPO 本质上是利用了重要性采样技术。

理论背景：优化目标是新策略 $\pi_\theta$ ，但训练数据来自旧策略 $\pi_{\theta_{old}}$ 的采样分布。
补偿机制：通过概率比率 $r_t(\theta)$ ，对数据分布偏差做重要性采样修正。
约束：重要性采样要求两个分布不能差太远，否则方差会爆炸。这正是 $L^{CLIP}$ 存在的根本原因——它在数学上维护了重要性采样的有效区间。

1.2.1. 优势估计

通常采用 GAE (Generalized Advantage Estimation)。

简单来说，优势函数 $\hat{A}_t$ 的目标是回答：“在状态 $s_t$ 下采取动作 $a_t$ ，比平均情况（即 Baseline）好多少？”

1.2.1.1. 计算时序差分残差（Temporal Difference Error）

首先计算每一个时间步的即时偏差 $\delta_t$ 。它衡量了“实际观测到的奖励 + 下一步的估值”与“当前估值”之间的差距：

\delta_t = r_t + \gamma V(s_{t+1}) - V(s_t)

$r_t$ ：当前步获得的奖励。
$V(s_{t+1})$ ：神经网络（Critic）对下一步状态的估值。
$V(s_t)$ ：神经网络（Critic）对当前状态的估值。

1.2.1.2. 累加衰减

[0, T)

优势估计 $\hat{A}_t$ 不是只看当前这一步，而是要把未来的 $\delta$ 都考虑进来，但要进行指数衰减。公式如下：

\hat{A}_t = \delta_t + (\gamma\lambda)\delta_{t+1} + (\gamma\lambda)^2\delta_{t+2} + \cdots + (\gamma\lambda)^{T-1-t}\delta_{T-1}

这里有两个关键的超参数：

$\gamma$ (Gamma)：折扣因子（通常 0.99），决定了对远期奖励的重视程度。
$\lambda$ (Lambda)：GAE 因子（通常 0.95），用于在偏差（Bias）和方差（Variance）之间做权衡。

实现时，逆序（t）计算

如果 $\lambda = 0$ ： $\hat{A}_t = \delta_t$ 。这叫 1-step TD。它很稳定（方差小），但如果 $V$ 函数估值不准，它就会错得离谱（偏差大）。
如果 $\lambda = 1$ ： $\hat{A}_t$ 变成了从当前步到截断点 $T$ 的所有奖励累加。这很真实（无偏差），但环境随机性太强，导致数值跳变剧烈（方差大）。

这就是 $\lambda$ 用于在偏差（Bias）和方差（Variance）之间做权衡的物理意义。PPO 选取 $\lambda = 0.95$ 它在“相信神经网络的估值”和“相信实际观测到的奖励”之间取了一个折中。

1.2.1.3. 标准化 (Advantage Normalization)

在算出 $T$ 个时间步的所有 $\hat{A}_t$ 后，工程上通常会进行一次标准化处理：

\hat{A}_t = \frac{\hat{A}_t - \text{mean}(\hat{A})}{\text{std}(\hat{A}) + 10^{-8}}

稳定梯度：在一个 Batch 中，优势值的数值跨度可能很大。标准化后，它们的均值为 0，标准差为 1。
逻辑闭环：这确保了在一个 Batch 里，大约有一半的动作会被认为是“好于平均”（正值，增加概率），另一半是“差于平均”（负值，减小概率）。这对于 Adam 优化器的稳定收敛极其重要。

1.2.1.4. 总结计算流程

r_t(\theta) = \frac{\pi_\theta(a_t | s_t)}{\pi_{\theta_{old}}(a_t | s_t)}

运行 $T$ 步采样，收集所有的 $r$ 概率比例和 $V$ 状态价值。
从后往前计算（这样可以用 $A_{t+1}$ 算出 $A_t$ ）：

$A_t = \delta_t + (\gamma\lambda) A_{t+1}$

对整个 Batch 进行标准化。
将算好的 $\hat{A}$ 输入 $L^{CLIP}$ 进行优化。

1.2.2. 损失函数

优势估计 $\hat{A}_t$ 和概率比率 $r_t(\theta)$ 都准备好了，进入 PPO 执行阶段构建 Loss 函数并进行参数更新

Adam 更新时并不是「三个互不相关的 loss 各算各的」，而是把 策略裁剪项、价值拟合项、熵项 合成 一个标量目标（再按实现约定取正/取负）做一次反传。

总损失函数 $L^{CLIP+VF+S}_t$ 通常长这样：

L_t^{total}(\theta) = L_t^{CLIP}(\theta) - c_1 L_t^{VF}(\theta) + c_2 S[\pi_\theta](s_t)

其中 价值项前面是减号、熵项前面是加号（与 $c_1,c_2$ 一起决定「多拟合价值 / 多鼓励随机」的强度）。如框图里写「L_clip + L_V + 熵」往往只是 并列这三类成分都会进同一轮梯度，不表示数学上三项都是同号的「单纯相加」——具体 + / - 以代码里 loss = … 的写法为准（常见做法是对「要最大化的 surrogate（代理目标：用样本可算的式子近似真实策略改进）」整体取负再交给优化器 minimize）。

L^{CLIP}(\theta) = \hat{\mathbb{E}}_t \left[ \min \left( r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1 - \epsilon, 1 + \epsilon) \hat{A}_t \right) \right]

这三个部分分工明确：

$L_t^{CLIP}(\theta)$ (策略损失)：利用 $\hat{A}_t$ 和 $r_t(\theta)$ 进行裁剪优化。直观上，它会在优势为正时提高动作概率，同时用 clip 限制单次更新幅度，避免策略变化过快。
$L_t^{VF}(\theta)$ (价值损失)：通常是均方误差 $MSE(V_\theta(s_t), V_{target})$ 。它负责让 Critic 的状态价值预测更贴近目标回报（更准确）。
$S[\pi_\theta](s_t)$ （entropy bonus，熵奖励 / 熵正则项）：在目标里加一项与策略分布的 Shannon 熵 成正比的奖励，鼓励动作分布别太快变成「几乎总选某几个动作」；直观上就是 多留一点随机性、减缓过早收敛（文献与代码里也常写 entropy regularization、policy entropy）。

MSE 均方误差

1.2.2.1. 执行 Adam 更新

Adam 优化器, 梯度下降 (Gradient Descent)

有了总损失后，流程如下：

计算梯度：对总损失关于参数 $\theta$ 求导（即之前提到的 $L \text{ wrt } \theta$ ）。
反向传播：将梯度传回神经网络。
参数更新：Adam 优化器根据动量和自适应学习率微调 $\theta$ 。

进入 $K$ 个 Epoch 的循环

针对同一批采样数据（那 $NT$ 个样本），反复进行 $K$ 次上述的“计算 Loss -> 更新参数”过程。

在第 1 遍时： $r_t(\theta_{old}) = 1$ ，概率比退化为恒等映射，更新等价于未做分布修正的常规梯度步。
在第 $K$ 遍时：由于参数已经改了好几次，新旧策略的偏差 $r_t(\theta)$ 可能会很大。这时候 Clipping（裁剪）就会大显身手，强行把那些偏移过大的梯度归零，防止模型跑飞。

注：虽然 PPO 的理论目标是最大化奖励，但在代码实现中，通常会对总目标函数取负值，将其转化为最小化损失，从而用 Adam 等优化器做梯度更新。

1.2.2.2. 更新旧策略

$\theta_{old} \leftarrow \theta$

当 $K$ 次迭代结束，这一批数据的价值就被“榨干”了。
此时，把当前的最新参数 $\theta$ 赋值给 $\theta_{old}$ 。然后清空缓存的数据，回到环境里，开启下一轮 $N \times T$ 的数据采集。

1.3. Hyperparameters 参考

参数	常用值	作用
$\epsilon$	$0.1 \sim 0.2$	裁剪阈值，限制单次更新步长
$\gamma$	$0.99$	长期奖励折扣因子
$\lambda$	$0.95$	GAE 平衡因子
$c_1$	$0.5$	价值损失权重（MSE 权重）
$c_2$	$0.01$	entropy coefficient（熵项权重）：调大则更鼓励探索、策略更「散」；调小则更贴 reward、更易早收敛
$K$	$3 \sim 10$	每个 Batch 的重复训练次数（Epochs）

2. RLHF 中的 PPO

在很多 LLM 对齐/偏好优化的工程实现里，会看到 “PPO + reference model（参考模型）”。这很容易让人误以为 reference model 是 PPO 论文（Schulman 2017）的一部分；但严格来说，它是 RLHF 场景下额外加入的约束/正则，用来防止策略为了刷 reward 而跑飞（reward hacking、语言退化、分布崩坏等）。

2.1. RLHF 训练 flow

SFT → RM → PPO

可以把最常见的 RLHF 流程理解成三段：

SFT：用高质量指令数据把模型先教会“基本说话方式”，得到 $\pi_{\text{SFT}}$ ；它常常也会作为后面的 $\pi_{ref}$ （冻结参考模型）。
Reward Model（RM）：用偏好数据训练一个打分器 $R(x,y)$ （或 $r_\phi(x,y)$ ），用于刻画“在相同输入下，哪些输出更受偏好”。
PPO-RLHF：从 $\pi_{\text{SFT}}$ 初始化可训练策略 $\pi_\theta$ ，用 PPO 提高 $R$ ，同时用 KL-to-reference 把 $\pi_\theta$ 拴在 $\pi_{ref}$ 附近。

而 PPO-RLHF 的实现，通常就是把“文本生成”当成一条轨迹上的序列决策，然后复用前边提到的 PPO 两阶段循环：

自回归 MDP（最常见的设定）：第 $t$ 步的“动作”是下一个 token $y_t$ ；状态可以抽象成 $(x,y_{<t})$ 。
Rollout：用 $\pi_{\theta_{old}}$ 采样一批 completions（得到 token 轨迹与 logprob）。
Reward / shaping：把 RM 分数与 KL shaping 组合成每步可用的标量回报信号（工程上常见是把 KL 摊到 token；RM 可能是序列末一次性给分，也可能有更细的 shaping，取决于实现）。
- reward shaping 在这里可以直观理解为：不只给“最后好不好”的稀疏信号，而是额外构造/改写一组更密、更及时的逐步回报，让 PPO 在生成过程中更容易学、也更可控；其中 per-token 的 KL 项就是很典型的 shaping。
- RM shaping 则更具体：指把 reward model 的偏好信号从“只在结尾给一次分”，扩展成更稠密的过程性反馈（例如分段打分、对关键子结构/步骤给增量奖励、或把可验证规则与 RM 组合成逐步项）。不同系统差异很大；设计不当也可能让模型去“刷 RM shaping”而不是真正提升偏好质量，因此通常仍会配合 KL-to-reference 与谨慎的系数/裁剪。
Optimization：在同一批数据上算优势（GAE），再按实现把 $L^{CLIP}$ 、value loss、entropy bonus 合成 一个标量 loss 做 $K$ 个 epoch；最后更新 $\theta_{old}\leftarrow\theta$ ，进入下一轮 rollout。

一句话总结：RM 给方向， $\pi_{ref}$ + KL 给长期护栏，PPO（尤其 clipping）给短期稳定更新。

2.1.1. 模块框图

下图按常见实现，把模块分成两类：

一类是 在进入 PPO 对齐之前就已经训好、此阶段通常不再更新的模型：reference（多为 SFT 得到的 $\pi_{ref}$ ，作锚点）与 reward model（在偏好数据上训好的打分器）。工程图里常把它们画成 external / frozen：参数固定，只提供 KL-to-reference 与 reward / scoring 等信号。

另一类是 当前正在被 PPO 更新的策略网络：Rollout 用 $\pi_{\theta_{old}}$ 采样——它不是另一套独立权重，而是与 Actor 同一组参数、上一轮留下的策略快照。Actor 与 Critic（常为同一 backbone 上的 policy / value head）把各步里由 RM、KL 等拼出的标量 reward 写成 $r_t$ ，再经 GAE、PPO loss 反传更新 $\theta$ ；最后 $\theta_{old}\leftarrow\theta$ ，进入下一轮 rollout。

PPO-RLHF 模块关系

2.2. 两个“旧策略”不要混

PPO 里几乎总会涉及旧策略，但它通常指的是：

$\pi_{\theta_{old}}$ （PPO 的 old policy）：上一轮采样用的策略快照，用于重要性采样比率 $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ 它是 每一轮都会更新 的。

而 RLHF 工程里常说的 reference model 一般是：

$\pi_{ref}$ （RLHF 的 reference policy/model）：冻结的锚点模型（常见做法是 SFT 后的模型），用于给当前策略加一个 “别偏太远” 的约束；它通常在一段训练期间 保持不变 或更新频率很低。

2.3. KL-to-reference：把“别跑飞”写进目标

以 PPO-RLHF 常见写法为例，会把 reward 加上一个 KL 惩罚（或等价的 reward shaping）：

R'(x, y) = R(x, y) - \beta \, \mathrm{KL}\left(\pi_\theta(\cdot|x)\ \|\ \pi_{ref}(\cdot|x)\right)

这里的符号可以按“一条 RLHF 训练样本”来理解：

$x$ ：prompt / 输入上下文（用户问题、题目、对话历史等）
$y$ ：response / 输出序列（模型在 $x$ 条件下生成的整段回答 token 序列）
$R(x, y)$ ：在输入 $x$ 下输出 $y$ 的奖励（来自 reward model、规则打分等）
$\mathrm{KL}(\pi_\theta(\cdot|x)\ \|\ \pi_{ref}(\cdot|x))$ ：在同一个输入 $x$ 条件下，当前策略相对 reference policy 的分布偏离程度

于是 PPO 实际最大化的是 “奖励 - 偏离 reference 的代价”。直觉上：

如果只追 $R$ ：模型会倾向于钻 reward 的空子，偏离语言先验越来越大。
加上 KL：reference model 提供了一个长期锚点，PPO 的 clipped update 提供了一个短期的“每步别迈太大”，两者一起让训练更稳。

备注：不同实现里 KL 可能以多种形式进入（显式 KL penalty、或把 per-token logprob 差写进 reward），但核心都是 “把策略拴在 $\pi_{ref}$ 附近”。

一个常见的工程视角是把 KL “摊平”到 token 级别。设输出序列 $y=(y_1,\dots,y_T)$ ，则

在自回归语言模型里，这里的 时间步 $t$ 通常就是“生成第 $t$ 个 token 的那一步”（也就是 token index）：

$y_t$ ：第 $t$ 步采样得到的那个 token
$y_{<t}=(y_1,\dots,y_{t-1})$ ：到第 $t$ 步之前已经生成的前缀（第 1 步时为空前缀）

因此 $T$ 就是这条输出序列的长度（token 数）。这和传统 RL 里“环境每走一步”的时间轴可以不同：在 LLM 文本生成里，“一步”往往等价于“再生成一个 token”。

\log \pi_\theta(y|x) - \log \pi_{ref}(y|x) = \sum_{t=1}^T \Big(\log \pi_\theta(y_t|x,y_{<t}) - \log \pi_{ref}(y_t|x,y_{<t})\Big)

如果只关心当前采样到的这条序列（on-policy 轨迹）上的惩罚，那么很多实现会定义一个 token 级别的“KL 代价”：

r^{KL}_t \triangleq -\beta\Big(\log \pi_\theta(y_t|x,y_{<t}) - \log \pi_{ref}(y_t|x,y_{<t})\Big)

这里的 $\log \pi_\theta(y_t|x,y_{<t})$ （logprob）就是：策略模型在时间步 $t$ 给出的“下一个 token”的条件概率分布 $\pi_\theta(\cdot|x,y_{<t})$ 中，取到实际 token $y_t$ 的概率再取对数（通常取自然对数）。

然后把它加进每一步的 reward（reward shaping）。这样累加起来就是序列级别的 logprob 差惩罚：

\sum_{t=1}^T r^{KL}_t = -\beta\Big(\log \pi_\theta(y|x) - \log \pi_{ref}(y|x)\Big)

直觉上：如果某个 token 在当前策略下的概率比 reference 更大（ $\log \pi_\theta - \log \pi_{ref} > 0$ ），那它会产生负的 shaping reward（惩罚），从而抑制策略在该方向上“越走越远”。

2.4. 推荐阅读

Ouyang et al., 2022. Training language models to follow instructions with human feedback (InstructGPT).（SFT → RM → PPO，以及 KL/reference 的由来）
Stiennon et al., 2020. Learning to summarize with human feedback.（更早期、端到端的 RLHF 案例）
Ziegler et al., 2019. Fine-Tuning Language Models from Human Preferences.（偏好优化 + KL 正则的直观版本）

扩展（对比视角，理解“reference 并非 PPO 专属”）：

Rafailov et al., 2023. Direct Preference Optimization (DPO).（绕开 RM 与 PPO，但同样体现 anchor/reference 的思想）

3. GRPO：从 PPO / RLHF 再往前走一小步

前文将 PPO 概括为“稳定的策略更新框架”，将 RLHF 概括为“RM + KL-to-reference + PPO”的常见落地形态。进一步地，在 数学推理 / 可验证奖励 这类场景里，训练目标仍然可以用 PPO 的 clipped objective，但 优势（advantage）与 baseline 的估计往往会变得更棘手。

GRPO（Group Relative Policy Optimization） 是在 DeepSeekMath 里提出的、PPO 的一个变体：动机之一是让 RL 在 LLM 场景里更省资源，同时处理 “reward 往往只在序列末出现、但 value 需要 token 级别监督” 这类不匹配。

这一节我按“从 PPO 视角推出来”的方式把 GRPO 的核心写清楚：它仍然用 PPO 的 ratio + clip 来做稳定更新，但把 critic/value baseline 换成了「同题采样组内」的相对基线。

仍然很 PPO：整体还是围绕 clipped ratio 的策略更新思路在转（可以把它理解成“骨架仍在 PPO”）。
关键变化：去掉 value模型 / critic：GRPO 不再额外训练一个与 policy 同量级的 value function 来给每个 token 做 baseline。
用 group 做相对基线：对同一个问题 $q$ ，先从旧策略采样一组输出 $\{o_1,\dots,o_G\}$ ，再用 组内相对比较 来构造优势（论文强调这与 reward model 常见的“同题对比训练”更一致）。
KL 处理方式也可能不同：论文里也讨论了与 PPO 场景下 KL penalty 不同的正则化思路（读 4.1 小节时对照实现会更清晰）。

3.1. 训练数据形态：同一个 prompt 采样一组

把单条 RLHF 样本写成 $(q, o)$ ：

$q$ ：问题 / prompt
$o$ ：一次完整的输出序列（completion），长度为 $T$

GRPO 的一个关键设定是：对同一个 $q$ ，从旧策略采样 $G$ 条输出：

o_1,\dots,o_G \sim \pi_{\theta_{old}}(\cdot|q)

然后对每条输出打分得到标量奖励（常见是 sequence-level）：

r_i \triangleq R(q, o_i),\quad i=1,\dots,G

在数学推理/可验证任务里， $R$ 往往是 rule-based（对/错、部分分、格式约束等）或 “RM + verifier + 规则” 的组合；它通常更像“末端一次性”信号，而不是密集的 token-level reward。

3.2. 组内相对基线：用同题均值/标准差构造 advantage

在 PPO 里我们常用 critic 给 baseline： $\hat{A}_t \approx Q(s_t,a_t)-V(s_t)$ 。但对 LLM 这种“奖励末端给、价值却要逐 token 监督”的设置，value 训练既贵又容易引入偏差（尤其当 reward 很稀疏）。

GRPO 的做法是：不训练 $V$ ，直接在同一个 $q$ 的组内做相对标准化。最直观的一种（也最常见的工程落地）是：

\mu_q = \frac{1}{G}\sum_{j=1}^G r_j,\qquad \sigma_q = \sqrt{\frac{1}{G}\sum_{j=1}^G (r_j-\mu_q)^2} + \varepsilon

A_i \triangleq \frac{r_i - \mu_q}{\sigma_q}

解释一下这个 $A_i$ ：

它是“相对优势”：同一道题里，高于组均值的输出会得到正优势（鼓励），低于均值的得到负优势（抑制）。
它天然做了尺度归一：不同题的 reward 尺度可能不同（0/1、0/100、log score…），用组内标准差能把不同样本的梯度量级拉到同一量纲。

你可以把它类比成你前面写的 advantage normalization，只是这里的 normalization 不是在“全 batch”，而是在“同 prompt 的 group 内”做。

一个常见的细节：虽然 $r_i$ 是序列级分数，但优化是 token 级 logprob。工程上通常直接把 同一个 $A_i$ 广播到这条序列的每个 token，等价于“这条输出整体好/坏，整条序列的 token 都一起被上调/下调概率”，写成：

A_{i,t} \equiv A_i,\quad t=1,\dots,T_i

3.3. 还是 PPO：ratio + clip 的策略更新目标

对每条输出 $o_i=(y_{i,1},\dots,y_{i,T_i})$ ，我们可以写出 token 级的 ratio（沿用 PPO 的重要性采样比）：

r_{i,t}(\theta) \triangleq \frac{\pi_\theta(y_{i,t}\mid q, y_{i,<t})}{\pi_{\theta_{old}}(y_{i,t}\mid q, y_{i,<t})}

那么 GRPO 的“PPO 样式” clipped objective 可以写成（对所有样本、所有 token 求平均）：

L^{GRPO}(\theta)= \hat{\mathbb{E}}_{i,t}\left[ \min\Big( r_{i,t}(\theta) A_i,\; \text{clip}(r_{i,t}(\theta),1-\epsilon,1+\epsilon)A_i \Big) \right]

你会发现它和你在 1.2.2 写的 $L^{CLIP}$ 形状完全一致，只不过：

PPO 的 $\hat{A}_t$ 来自 GAE + critic（或至少来自某种时间分解）
GRPO 的 $A_i$ 来自 同题 group 的相对基线（没有 critic）

3.4. KL 正则：仍然可以用 reference model 作为长期护栏

GRPO 并不排斥 “reference model + KL” 这个工程护栏。常见做法依旧是把 KL 以 shaping 的方式加进 reward 或直接加进目标。

沿用你前面 token 级 logprob 差的写法，对每个 token：

r^{KL}_{i,t} \triangleq -\beta\Big(\log \pi_\theta(y_{i,t}\mid q,y_{i,<t})-\log \pi_{ref}(y_{i,t}\mid q,y_{i,<t})\Big)

然后把序列的“有效奖励”写成：

r_i' = r_i + \sum_{t=1}^{T_i} r^{KL}_{i,t}

再用 $r_i'$ 去算组内的 $\mu_q,\sigma_q,A_i$ （或者只把 KL 当作单独 penalty 项，工程上两种都见过）。直觉上：group 相对基线解决“不要 critic 也能构造稳定方向”，KL 解决“不要跑飞”。

3.5. 对比 PPO

把 GRPO 和 PPO-RLHF 放在同一张心智模型里：

相同点（都像 PPO）：都有 ratio + clip，所以都允许对同一批 rollout 做多轮 epoch 更新。
不同点 1：baseline 来源：
- PPO： $V(s)$ （critic）提供 baseline，优势有时间结构（GAE）
- GRPO：同题 group 的均值/方差提供 baseline，优势更像“同题排序/对比”的信号
不同点 2：资源与稳定性 trade-off：
- 去掉 critic 通常更省显存/算力、实现也更直接
- 但会更依赖：采样组大小 $G$ 、reward 的区分度、以及 KL/clip 的护栏强度

3.6. 实现要点

$G$ 的作用很关键： $G$ 太小， $\mu_q,\sigma_q$ 噪声大；太大则 rollout 成本上升。实践里会把 $G$ 作为“用采样换稳定”的旋钮。
$\sigma_q$ 保护：当同组 reward 全相等（比如全错、全对）， $\sigma_q \approx 0$ ，需要加 $\varepsilon$ ，否则优势会爆。
长度偏置：如果把同一个 $A_i$ 广播到 token，再对 token 平均，长序列会贡献更多项。常见缓解：对每条序列先按 token 平均再按样本平均（或对每条样本按长度归一）。
old policy 的 logprob 要缓存：和 PPO 一样，rollout 时必须存 $\log \pi_{\theta_{old}}(y_{i,t}|q,y_{i,<t})$ ，优化阶段才能算 ratio。
reward 在 group 内算 baseline：baseline 的计算单位是 “同一个 $q$ 的 group”，不要把不同 prompt 混一起算均值/方差（否则失去“相对比较”的意义）。

3.7. 推荐阅读

Shao et al., 2024. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models.（GRPO 提出与动机）
如果你关心“无 critic 的 PPO 变体/相对优势”的谱系，可以把 GRPO 放到“对比学习式偏好信号 + KL 护栏 + clipped update”的框架里去理解：它更像把“对比/排序”当成 advantage 的来源，而不是显式学习 $V$ 。

Ray Job Init Containers in KubeRay: Lifecycle Nuances

发表于 2026-03-24 更新于 2026-03-31 分类于笔记本文字数： 1.2k 阅读时长 ≈ 1 分钟

在 KubeRay 里，Ray Job 由 Ray Cluster 承载与管理，真正的难点往往在于 如何把 Ray Job 与 Ray Cluster 的生命周期对齐。

rayjob-raycluster

先看通用 Job：常用 init container 拉数据、下发配置等；init 失败即整次 Job 失败，这很直观。

Ray Job 不同：它并不是「那一组实际跑在集群里的 Pod」——资源实体是 Ray Cluster；所谓 Ray Job 的 init container，也落在 Cluster 侧。结果是 Job 侧的 init 语义，会和 Cluster 的 init / bootstrap 语义绑在一起。

站在 Cluster 视角：init 失败时反复重试直到 bootstrap 成功，常常说得通——先得把集群建立起来。
站在 Job 视角：Job 是一次性任务，init 失败更合理的预期是 fail fast，而不是长期跟着 Cluster 重试。

也是在借助大模型拆解 issue、梳理场景，并对照代码与线上行为逐项验证的过程中，才逐渐理解：Ray Job 对 init container 的生命周期管理，很难用一套简单直白、一步到位的规则实现。

1h vibe issue, 8h vibe coding
70M tokens cost

https://github.com/ray-project/kuberay/issues/4637

两种典型模式：

新建 Ray Cluster：Ray Job 的 init container 会生效；在这条路径下，它实际就是跑在 Ray Cluster 上的 init container，与 Cluster 的 bootstrap 同一条链路。
使用已有 Ray Cluster：Ray Job 的 init container 不生效；Job 只消费已有 Cluster，不会为本次 Job 再单独跑一轮 init。

Cluster 自身的生命周期也要分开看：

Job 新建的 Cluster：可用 Ray Job 的 delete rule 决定在 Job 结束后是否删除 Cluster / Workers 等；默认为保留 Cluster。
沿用已有 Cluster：Ray Job 结束 不改变 Cluster 的生命周期（Cluster 可能继续服务其他任务或由别处托管）。

回到 Ray Job 自定义 init 失败——这发生在 Job 新建并绑定的专属 Cluster 上。Job 结束后如果 Cluster 长期保留，语义上确实容易别扭；而更合理的模式通常是 短时间保留现场（以收集日志和进行问题诊断），然后自动回收这个临时 Cluster。这本质上是一种「单次 Job + 一套临时 Cluster」的短生命期部署方式，与长期共用 Cluster 是两套完全不同的心智模型。

不过需要注意，当前 Ray Job 实际上并不支持这样的能力：Job 自定义 init 失败后，Job 仍然处于 initializing 状态，系统还不会自动实现「短暂保留现场再回收 Cluster」的行为。可以继续 vibe issue, 基于 issue vibe coding。

rl infra

发表于 2025-10-12 更新于 2026-03-24 分类于笔记本文字数： 302 阅读时长 ≈ 1 分钟

尚未系统整理，先记一个粗判断：整体架构在走向成熟，部署形态普遍解耦——不止训推分离，还出现了 Agent 应用与训推平台、训推 API（如 Tinker API）、训推框架的分层。

在这种形态下，常见会并行做两件事：一是用 OpenTelemetry（以 spans 为主） 做标准化 trace，沉淀模型 / Agent 的行为轨迹，再回流进 RL 训练闭环；二是通过 LLM Proxy 统一 Agent 侧使用的模型 API，在训练态把请求 转发到当前正在更新的模型，由它承担 RL 里的推理侧，避免应用侧调用与训练态模型服务两条路径对不齐。

关于轨迹记录与训练回流，直觉上和早年搜索推荐那一套并无本质不同：线上记录与埋点 → 数据回流 → 离线实验与训练。

clusterd

发表于 2025-06-01 更新于 2026-03-19 本文字数： 907 阅读时长 ≈ 1 分钟

https://www.hiascend.com/document/detail/zh/mindcluster/70rc1/clustersched/dlug/mxdlug_007.html

有如下几类 configmap

cmDevice: ns, kube-system; cmName, mindx-dl-deviceinfo-{NodeName}; which is reported by device-plugin
cmNode: ns, mindx-dl; cmName, mindx-dl-nodeinfo-{NodeName}; which is reported by nodeD
cmPingMesh: ns, cluster-system; cmName, pingmesh-config;
cmSuperPodDevice: ns, cluster-system; cmName, super-pod-{SuperPodId}; clusterD 维护
- 特别的 {RAS_NET_ROOT_PATH}/cluster/super-pod-{SuperPodId}/super-pod-{SuperPodId}.json; clusterD 维护
cmPubicFault: mc-consumer-publicfault=true label;

其中 cmDevice configmap mindx-dl-deviceinfo-{NodeName}, 由 device-plugin 上报, 包括如下信息

DeviceInfoCfg
SwitchInfoCfg

cmPubicFault configmap, 包括如下信息

PublicFault

pingmesh-config 的格式为 global pingmesh 任务的配置或者是指定 superpodid 的任务配置

{
    "activate": "on",
    "task_interval": 5
}

node annotation 中包括如下信息

product-serial-number
superPodID
baseDeviceInfos
serverType
serverIndex

transparent huge page

发表于 2024-06-30 更新于 2026-03-19 本文字数： 567 阅读时长 ≈ 1 分钟

THP

https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/

增加 page 大小, 从而减少 TLB 大小; 由于 walk TLB 开销较大, 所以是个优化

THP 会让 os 申请连续的内存空间大小, 但如果申请不到, 则 os 会开始 compact, reclaim or page out other pages;

That process is expensive and could cause latency spikes (up to seconds)

cat /proc/buddyinfo

Each column represents the number of pages of a certain order which are
available. In this case, there are 0 chunks of 2^0PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc…

https://andorian.blogspot.com/2014/03/making-sense-of-procbuddyinfo.html

model

发表于 2024-02-24 更新于 2026-04-19 本文字数： 1.6k 阅读时长 ≈ 1 分钟

https://wangcong.net/article/FPandBP.html

pathways

https://blog.research.google/2022/04/pathways-language-model-palm-scaling-to.html

a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators.

few-shot

TPU v4 Pods

Pipelining is typically used with DCN

word to vector, This vector represents the word’s meaning and context within the given language

embedding layer, lookup table

Positional encoding

https://medium.com/@tech-gumptions/transformer-architecture-simplified-3fb501d461c8

This means that the output of a layer is added to the initial input, allowing the model to learn to only make small changes to the input

The decoder’s job is to produce the English sentence based on both the original French sentence and the bits of the English sentence it has generated so far.

Input Embedding: Just as with the Encoder, the input to the Decoder (which is the target sequence during training) is first embedded into continuous vectors.

It’s important to note that this masking is only applied during training. During inference, the decoder can attend to all words in the target sequence, including future words.

To summarize, the Decoder in the Transformer architecture processes its input through self-attention, cross-attention with the Encoder’s output, and position-wise Feed-Forward networks, repeatedly for each stacked block, culminating in a final output sequence after the softmax operation.

https://jalammar.github.io/illustrated-transformer/

https://nlp.seas.harvard.edu/2018/04/03/attention.html

https://jalammar.github.io/illustrated-gpt2/

PPO/GRPO/重要性采样/拒绝采样

megatron 保存 ckpt 原理