在 KubeRay 里,Ray Job 由 Ray Cluster 承载与管理,真正的难点往往在于 如何把 Ray Job 与 Ray Cluster 的生命周期对齐

rayjob-raycluster

先看通用 Job:常用 init container 拉数据、下发配置等;init 失败即整次 Job 失败,这很直观。

Ray Job 不同:它并不是「那一组实际跑在集群里的 Pod」——资源实体是 Ray Cluster;所谓 Ray Job 的 init container,也落在 Cluster 侧。结果是 Job 侧的 init 语义,会和 Cluster 的 init / bootstrap 语义绑在一起

  • 站在 Cluster 视角:init 失败时反复重试直到 bootstrap 成功,常常说得通——先得把集群建立起来。
  • 站在 Job 视角:Job 是一次性任务,init 失败更合理的预期是 fail fast,而不是长期跟着 Cluster 重试。

也是在借助大模型拆解 issue、梳理场景,并对照代码与线上行为逐项验证的过程中,才逐渐理解:Ray Job 对 init container 的生命周期管理,很难用一套简单直白、一步到位的规则实现。

1h vibe issue, 8h vibe coding
70M tokens cost

https://github.com/ray-project/kuberay/issues/4637

两种典型模式:

  • 新建 Ray Cluster:Ray Job 的 init container 会生效;在这条路径下,它实际就是跑在 Ray Cluster 上的 init container,与 Cluster 的 bootstrap 同一条链路。
  • 使用已有 Ray Cluster:Ray Job 的 init container 不生效;Job 只消费已有 Cluster,不会为本次 Job 再单独跑一轮 init。

Cluster 自身的生命周期也要分开看:

  • Job 新建的 Cluster:可用 Ray Job 的 delete rule 决定在 Job 结束后是否删除 Cluster / Workers 等;默认为保留 Cluster
  • 沿用已有 Cluster:Ray Job 结束 不改变 Cluster 的生命周期(Cluster 可能继续服务其他任务或由别处托管)。

回到 Ray Job 自定义 init 失败——这发生在 Job 新建并绑定的专属 Cluster 上。Job 结束后如果 Cluster 长期保留,语义上确实容易别扭;而更合理的模式通常是 短时间保留现场(以收集日志和进行问题诊断),然后自动回收这个临时 Cluster。这本质上是一种「单次 Job + 一套临时 Cluster」的短生命期部署方式,与长期共用 Cluster 是两套完全不同的心智模型。

不过需要注意,当前 Ray Job 实际上并不支持这样的能力:Job 自定义 init 失败后,系统还不会自动实现「短暂保留现场再回收 Cluster」的行为。可以继续 vibe issue, 基于 issue vibe coding。

尚未系统整理,先记一个粗判断:整体架构在走向成熟,部署形态普遍解耦——不止训推分离,还出现了 Agent 应用与训推平台、训推 API(如 Tinker API)、训推框架的分层

在这种形态下,常见会并行做两件事:一是用 OpenTelemetry(以 spans 为主) 做标准化 trace,沉淀模型 / Agent 的行为轨迹,再回流进 RL 训练闭环;二是通过 LLM Proxy 统一 Agent 侧使用的模型 API,在训练态把请求 转发到当前正在更新的模型,由它承担 RL 里的推理侧,避免应用侧调用与训练态模型服务两条路径对不齐。

关于轨迹记录与训练回流,直觉上和早年搜索推荐那一套并无本质不同:线上记录与埋点 → 数据回流 → 离线实验与训练

https://www.hiascend.com/document/detail/zh/mindcluster/70rc1/clustersched/dlug/mxdlug_007.html

有如下几类 configmap

  • cmDevice: ns, kube-system; cmName, mindx-dl-deviceinfo-{NodeName}; which is reported by device-plugin
  • cmNode: ns, mindx-dl; cmName, mindx-dl-nodeinfo-{NodeName}; which is reported by nodeD
  • cmPingMesh: ns, cluster-system; cmName, pingmesh-config;
  • cmSuperPodDevice: ns, cluster-system; cmName, super-pod-{SuperPodId}; clusterD 维护
    • 特别的 {RAS_NET_ROOT_PATH}/cluster/super-pod-{SuperPodId}/super-pod-{SuperPodId}.json; clusterD 维护
  • cmPubicFault: mc-consumer-publicfault=true label;

其中 cmDevice configmap mindx-dl-deviceinfo-{NodeName}, 由 device-plugin 上报, 包括如下信息

  • DeviceInfoCfg
  • SwitchInfoCfg

cmPubicFault configmap, 包括如下信息

  • PublicFault

pingmesh-config 的格式为 global pingmesh 任务的配置或者是指定 superpodid 的任务配置

1
2
3
4
{
"activate": "on",
"task_interval": 5
}

node annotation 中包括如下信息

  • product-serial-number
  • superPodID
  • baseDeviceInfos
  • serverType
  • serverIndex

THP

https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/

增加 page 大小, 从而减少 TLB 大小; 由于 walk TLB 开销较大, 所以是个优化

THP 会让 os 申请连续的内存空间大小, 但如果申请不到, 则 os 会开始 compact, reclaim or page out other pages;

That process is expensive and could cause latency spikes (up to seconds)

cat /proc/buddyinfo

Each column represents the number of pages of a certain order which are
available. In this case, there are 0 chunks of 2^0PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1
PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc…

https://andorian.blogspot.com/2014/03/making-sense-of-procbuddyinfo.html

https://wangcong.net/article/FPandBP.html

pathways

https://blog.research.google/2022/04/pathways-language-model-palm-scaling-to.html

a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators.

few-shot

TPU v4 Pods

Pipelining is typically used with DCN

word to vector, This vector represents the word’s meaning and context within the given language

embedding layer, lookup table

Positional encoding

https://medium.com/@tech-gumptions/transformer-architecture-simplified-3fb501d461c8

This means that the output of a layer is added to the initial input, allowing the model to learn to only make small changes to the input

The decoder’s job is to produce the English sentence based on both the original French sentence and the bits of the English sentence it has generated so far.

Input Embedding: Just as with the Encoder, the input to the Decoder (which is the target sequence during training) is first embedded into continuous vectors.

It’s important to note that this masking is only applied during training. During inference, the decoder can attend to all words in the target sequence, including future words.

To summarize, the Decoder in the Transformer architecture processes its input through self-attention, cross-attention with the Encoder’s output, and position-wise Feed-Forward networks, repeatedly for each stacked block, culminating in a final output sequence after the softmax operation.

https://github.com/NVIDIA/go-nvml

The nvml.h file is a direct copy of nvml.h from the NVIDIA driver. Since the NVML API is guaranteed to be backwards compatible, we should strive to keep this always up to date with the latest.

https://github.com/xlab/c-for-go.git

golang cgo

https://www.rectcircle.cn/posts/go-static-compile-and-cgo

https://chai2010.cn/advanced-go-programming-book/ch2-cgo/ch2-05-internal.html

poc env, windows11 + wsl2 ubuntu 18.04

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
~/projects/go-nvml ❯ nvidia-smi
Sun Dec 17 18:57:57 2023
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 545.29.04 Driver Version: 546.17 CUDA Version: 12.3 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 NVIDIA GeForce RTX 3070 Ti On | 00000000:06:00.0 On | N/A |
| 0% 33C P8 13W / 290W | 1139MiB / 8192MiB | 1% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+

+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 23 G /Xwayland N/A |
+---------------------------------------------------------------------------------------+

test code

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package main

import (
"fmt"
"log"
"os"

"github.com/NVIDIA/go-nvml/pkg/nvml"
)

func getNvidiaDeviceCount() {
ret := nvml.Init()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to initialize NVML: %v", nvml.ErrorString(ret))
}
count, ret := nvml.DeviceGetCount()
if ret != nvml.SUCCESS {
log.Fatalf("Unable to get device count: %v", nvml.ErrorString(ret))
}
fmt.Printf("%d\n", count)
}

func main() {
args := os.Args
if len(args) < 2 {
fmt.Println("hello")
} else {
getNvidiaDeviceCount()
}
}

build commands

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
export CGO_LDFLAGS="-Wl,-z,now"

go build main.go
./main
./main: symbol lookup error: ./main: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV

./main fake
./main: symbol lookup error: ./main: undefined symbol: nvmlGpuInstanceGetComputeInstanceProfileInfoV

# now to lazy
export CGO_LDFLAGS="-Wl,-z,lazy"
go build main.go
./main
hello

./main fake
1

go clean --cache && rm -rf main
go build -work -x main.go

go build -x

1
2
3
4
5
6
7
8
9
10
11
12
cd /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml
TERM='dumb' CGO_LDFLAGS='"-Wl,-z,lazy" "-Wl,--unresolved-symbols=ignore-in-object-files" "-Wl,--unresolved-symbols=ignore-in-object-files"' /root/tools/go/pkg/tool/linux_amd64/cgo -objdir $WORK/b002/ -importpath github.com/NVIDIA/go-nvml/pkg/nvml -- -I $WORK/b002/ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 ./cgo_helpers.go ./const.go ./init.go ./nvml.go
cd $WORK/b002
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -I ./ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -o ./_x001.o -c _cgo_export.c
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -I ./ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -o ./_x002.o -c cgo_helpers.cgo2.c
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -I ./ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -o ./_x003.o -c const.cgo2.c
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -I ./ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -o ./_x004.o -c init.cgo2.c
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -I ./ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -o ./_x005.o -c nvml.cgo2.c
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -I ./ -g -O2 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -DNVML_NO_UNVERSIONED_FUNC_DEFS=1 -o ./_cgo_main.o -c _cgo_main.c
cd /root/projects/go-nvml
TERM='dumb' gcc -I /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml -fPIC -m64 -pthread -fmessage-length=0 -fdebug-prefix-map=$WORK/b002=/tmp/go-build -gno-record-gcc-switches -o $WORK/b002/_cgo_.o $WORK/b002/_cgo_main.o $WORK/b002/_x001.o $WORK/b002/_x002.o $WORK/b002/_x003.o $WORK/b002/_x004.o $WORK/b002/_x005.o -Wl,-z,lazy -Wl,--unresolved-symbols=ignore-in-object-files -Wl,--unresolved-symbols=ignore-in-object-files
TERM='dumb' /root/tools/go/pkg/tool/linux_amd64/cgo -dynpackage nvml -dynimport $WORK/b002/_cgo_.o -dynout $WORK/b002/_cgo_import.go
  1. /tmp/go-build2475505462/b002/nvml.cgo1.go
  2. /tmp/go-build2475505462/b002/nvml.cgo2.c
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
CGO_NO_SANITIZE_THREAD
void
_cgo_c813f6172e91_Cfunc_nvmlGpuInstanceGetComputeInstanceProfileInfoV(void *v)
{
struct {
nvmlGpuInstance_t p0;
unsigned int p1;
unsigned int p2;
nvmlComputeInstanceProfileInfo_v2_t* p3;
nvmlReturn_t r;
char __pad28[4];
} __attribute__((__packed__, __gcc_struct__)) *_cgo_a = v;
char *_cgo_stktop = _cgo_topofstack();
__typeof__(_cgo_a->r) _cgo_r;
_cgo_tsan_acquire();
_cgo_r = nvmlGpuInstanceGetComputeInstanceProfileInfoV(_cgo_a->p0, _cgo_a->p1, _cgo_a->p2, _cgo_a->p3);
_cgo_tsan_release();
_cgo_a = (void*)((char*)_cgo_a + (_cgo_topofstack() - _cgo_stktop));
_cgo_a->r = _cgo_r;
_cgo_msan_write(&_cgo_a->r, sizeof(_cgo_a->r));
}

ChatGPT 3.5

-Wl,-z,lazy, -Wl,-z,now

-Wl,-z,lazy: The -Wl,-z,lazy flag in the gcc command is a linker option used to instruct the linker to utilize lazy binding for dynamic libraries during the linking process.
When a program uses shared libraries (dynamic libraries), such as .so files in Linux, the linking process involves resolving symbols (functions or global variables) from these libraries. Lazy binding delays the resolution of these symbols until they are actually referenced during the program’s execution, rather than resolving all symbols at startup.
Lazy binding delays the resolution of these symbols until they are actually referenced during the program’s execution, rather than resolving all symbols at startup.

-Wl,-z,now: When you compile a program using gcc with the -Wl,-z,now flag, it influences how the dynamic linker behaves at runtime, particularly when the program is executed and loaded into memory. This flag impacts the linking stage, ensuring that symbols from shared libraries are resolved and bound immediately during the linking phase.
During the binary’s execution, when shared libraries are loaded, immediate binding might help in reducing the overhead associated with symbol resolution at runtime because the symbols are already resolved and bound during the linking process.
In summary, the -Wl,-z,now flag influences the behavior of the linker while creating the binary, affecting how symbol resolution occurs when the binary is loaded and executed, potentially impacting the startup performance by pre-resolving symbols.

近期的一些杂项

infra decouple from kind of internal production system

重点并不是 infra 如何帮忙自动恢复, 只是略有提到; 重点还是训练的调参

famous uncorrectable ECC error

we just restart the run

try to make run stable (数学上的稳定)

FP16

Lost GPU
CUDA errors
Job hanging
NCCL error
Job Slowdown

High DRAM correctable errors etc.
blob storage issues

when we are training these models, we kind of just stare at tensorboard all day

in general the mixture of hardware issues, training like numerical converting issues

~30days change the hyperparameter to try to get through

56days, 53 - 54 restarts, OPT-175B survived 143K steps

Andrej Karpathy

LLM

LAMA-2-70B

fp16, 2bytes, 70B

2 * 70B = 140B bytes = 140 * 1,000,000,000 bytes = 140,000,000,000 bytes = 140 gigabytes (bytes, kbytes, mbytes, gbytes)

140GB

tokenize

encoder, 将字符串转换为整数编码
decoder, 将整数编码转为字符串

Terms

  • SXM: Server PCI Express Module, a high bandwidth socket solution for connecting Nvidia Compute Accelerators to a system
  • NVL: NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub.
  • PCIe: PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e,[1] is a high-speed serial computer expansion bus standard

from Wikipedia

H800 vs H100

  1. https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet

  2. NVIDIA H100 Tensor Core GPU

  3. H800 没找到 NVIDIA 官网 Specification, 只能从代理商和一些B站UP主看到的数据

H800 SXM H100 SXM
FP64 1 teraFLOPS 34 teraFLOPS
FP64 Tensor Core 1 teraFLOPS 67 teraFLOPS
FP32 67 teraFLOPS 67 teraFLOPS
TF32 Tensor Core 989 teraFLOPS 989 teraFLOPS
BFLOAT16 Tensor Core 1,979 teraFLOPS 1,979 teraFLOPS
FP16 Tensor Core 1,979 teraFLOPS 1,979 teraFLOPS
FP8 Tensor Core 3,958 teraFLOPS 3,958 teraFLOPS
INT8 Tensor Core 3,958 TOPS 3,958 TOPS
GPU memory 80GB 80GB
GPU memory bandwidth 3.35TB/s 3.35TB/s
Interconnect NVLink 400GB/s PCIe Gen5: 128GB/s NVLink 900GB/s PCIe Gen5: 128GB/s
  • H800 FP64 算力限制

Driver

https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

https://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/data-center/h100/PB-11133-001_v01.pdf

Software Specifications

Specification Description
Driver support Linux: R520 or later

CUDA

https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-arch-matrix

Architecture CUDA Capabilities First CUDA Toolkit Support
Hopper 9.0 CUDA 11.8
CUDA 12.0

TensorFlow

https://www.tensorflow.org/install/source#tested_build_configurations

Version Python version Compiler Build tools cuDNN CUDA
tensorflow-2.15.0 3.9-3.11 Clang 16.0.0 Bazel 6.1.0 8.8 12.2
tensorflow-2.14.0 3.9-3.11 Clang 16.0.0 Bazel 6.1.0 8.7 11.8
tensorflow-2.13.0 3.8-3.11 Clang 16.0.0 Bazel 5.3.0 8.6 11.8
tensorflow-2.12.0 3.8-3.11 GCC 9.3.1 Bazel 5.3.0 8.6 11.8
tensorflow-2.11.0 3.7-3.10 GCC 9.3.1 Bazel 5.3.0 8.1 11.2
tensorflow-2.6.0 3.6-3.9 GCC 7.3.1 Bazel 3.7.2 8.1 11.2

candidates on H800

  • >= tensorflow-2.12.0

docker images

1
2
3
docker pull tensorflow/tensorflow:2.14.0-gpu
docker pull tensorflow/tensorflow:2.13.0-gpu
docker pull tensorflow/tensorflow:2.12.0-gpu

PyTorch

https://pytorch.org/get-started/previous-versions/

Version CUDA
v1.13.1 11.6, 11.7
v2.0.0 11.7, 11.8
v2.0.1 11.7, 11.8
v2.1.0 11.8, 12.1
v2.1.1 11.8, 12.1

candidates on H800

  • >= v2.0.0, with cuda 11.8 support

docker images

1
2
3
4
5
docker pull pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel
docker pull pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime

docker pull pytorch/pytorch:2.0.1-cuda11.7-cudnn8-devel
docker pull pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
from modelarts.session import Session
from modelarts.estimatorV2 import Estimator
from modelarts.train_params import OutputData
from modelarts.train_params import InputData

session = Session(access_key='XXX',secret_key='YYY', project_id='ZZZ', region_name='cn-north-4')

# list job
# job_list = Estimator.get_job_list(session=session, offset=0, limit=10, sort_by="create_time", order="desc")
# print(job_list)

# create a basic training job
estimator = Estimator(session=session,
job_description='This is a basic training job',
user_image_url="deep-learning-demo/mpi:3.0.0-cuda10.2", # main container 的容器镜像地址
user_command="echo hello-world", # main container 的启动命令
outputs=[OutputData(obs_path="obs://zs-modelarts/pytorch/model/", name="model", local_path="/model", access_method="env")],
log_url="obs://zs-modelarts/pytorch/log/", # 训练作业日志转存 obs 路径
train_instance_type="modelarts.p3.large.public.free", # 公共资源池
train_instance_count=1 # 训练作业节点个数
)

job_instance = estimator.fit(job_name="job-0")

# get job id in job_instance
print(job_instance.job_id)

# view the training job log
# estimator = Estimator(session=session, job_id="2bfc13b6-782e-45ad-ae90-476dfa97591a")
# info = estimator.get_job_log()
# print(info)

# view the training job metrics
# estimator = Estimator(session=session, job_id="2bfc13b6-782e-45ad-ae90-476dfa97591a")
# info = estimator.get_job_metrics()
# print(info)

# delete the training job metrics
# Estimator.delete_job_by_id(session=session, job_id="2bfc13b6-782e-45ad-ae90-476dfa97591a")
0%