clusterd
https://www.hiascend.com/document/detail/zh/mindcluster/70rc1/clustersched/dlug/mxdlug_007.html
有如下几类 configmap
- cmDevice: ns, kube-system; cmName, mindx-dl-deviceinfo-{NodeName}; which is reported by device-plugin
- cmNode: ns, mindx-dl; cmName, mindx-dl-nodeinfo-{NodeName}; which is reported by nodeD
- cmPingMesh: ns, cluster-system; cmName, pingmesh-config;
- cmSuperPodDevice: ns, cluster-system; cmName, super-pod-{SuperPodId}; clusterD 维护
- 特别的 {RAS_NET_ROOT_PATH}/cluster/super-pod-{SuperPodId}/super-pod-{SuperPodId}.json; clusterD 维护
- cmPubicFault: mc-consumer-publicfault=true label;
其中 cmDevice configmap mindx-dl-deviceinfo-{NodeName}, 由 device-plugin 上报, 包括如下信息
- DeviceInfoCfg
- SwitchInfoCfg
cmPubicFault configmap, 包括如下信息
- PublicFault
pingmesh-config 的格式为 global pingmesh 任务的配置或者是指定 superpodid 的任务配置
1 | { |
node annotation 中包括如下信息
- product-serial-number
- superPodID
- baseDeviceInfos
- serverType
- serverIndex
transparent huge page
THP
https://alexandrnikitin.github.io/blog/transparent-hugepages-measuring-the-performance-impact/
增加 page 大小, 从而减少 TLB 大小; 由于 walk TLB 开销较大, 所以是个优化
THP 会让 os 申请连续的内存空间大小, 但如果申请不到, 则 os 会开始 compact, reclaim or page out other pages;
That process is expensive and could cause latency spikes (up to seconds)
cat /proc/buddyinfo
Each column represents the number of pages of a certain order which are
available. In this case, there are 0 chunks of 2^0PAGE_SIZE available in
ZONE_DMA, 4 chunks of 2^1PAGE_SIZE in ZONE_DMA, 101 chunks of 2^4*PAGE_SIZE
available in ZONE_NORMAL, etc…
https://andorian.blogspot.com/2014/03/making-sense-of-procbuddyinfo.html
model
https://wangcong.net/article/FPandBP.html
pathways
https://blog.research.google/2022/04/pathways-language-model-palm-scaling-to.html
a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators.
few-shot
TPU v4 Pods
Pipelining is typically used with DCN
word to vector, This vector represents the word’s meaning and context within the given language
embedding layer, lookup table
Positional encoding
https://medium.com/@tech-gumptions/transformer-architecture-simplified-3fb501d461c8
This means that the output of a layer is added to the initial input, allowing the model to learn to only make small changes to the input
The decoder’s job is to produce the English sentence based on both the original French sentence and the bits of the English sentence it has generated so far.
Input Embedding: Just as with the Encoder, the input to the Decoder (which is the target sequence during training) is first embedded into continuous vectors.
It’s important to note that this masking is only applied during training. During inference, the decoder can attend to all words in the target sequence, including future words.
To summarize, the Decoder in the Transformer architecture processes its input through self-attention, cross-attention with the Encoder’s output, and position-wise Feed-Forward networks, repeatedly for each stacked block, culminating in a final output sequence after the softmax operation.
nccl tests
nvml cgo
https://github.com/NVIDIA/go-nvml
The
nvml.hfile is a direct copy ofnvml.hfrom the NVIDIA driver. Since the NVML API is guaranteed to be backwards compatible, we should strive to keep this always up to date with the latest.
https://github.com/xlab/c-for-go.git
golang cgo
https://www.rectcircle.cn/posts/go-static-compile-and-cgo
https://chai2010.cn/advanced-go-programming-book/ch2-cgo/ch2-05-internal.html
poc env, windows11 + wsl2 ubuntu 18.04
1 | ~/projects/go-nvml ❯ nvidia-smi |
test code
1 | package main |
build commands
1 | export CGO_LDFLAGS="-Wl,-z,now" |
go build -x
1 | cd /root/go/pkg/mod/github.com/!n!v!i!d!i!a/go-nvml@v0.12.0-1/pkg/nvml |
- /tmp/go-build2475505462/b002/nvml.cgo1.go
- /tmp/go-build2475505462/b002/nvml.cgo2.c
1 | CGO_NO_SANITIZE_THREAD |
ChatGPT 3.5
-Wl,-z,lazy, -Wl,-z,now
-Wl,-z,lazy: The -Wl,-z,lazy flag in the gcc command is a linker option used to instruct the linker to utilize lazy binding for dynamic libraries during the linking process.
When a program uses shared libraries (dynamic libraries), such as .so files in Linux, the linking process involves resolving symbols (functions or global variables) from these libraries. Lazy binding delays the resolution of these symbols until they are actually referenced during the program’s execution, rather than resolving all symbols at startup.
Lazy binding delays the resolution of these symbols until they are actually referenced during the program’s execution, rather than resolving all symbols at startup.-Wl,-z,now: When you compile a program using gcc with the -Wl,-z,now flag, it influences how the dynamic linker behaves at runtime, particularly when the program is executed and loaded into memory. This flag impacts the linking stage, ensuring that symbols from shared libraries are resolved and bound immediately during the linking phase.
During the binary’s execution, when shared libraries are loaded, immediate binding might help in reducing the overhead associated with symbol resolution at runtime because the symbols are already resolved and bound during the linking process.
In summary, the -Wl,-z,now flag influences the behavior of the linker while creating the binary, affecting how symbol resolution occurs when the binary is loaded and executed, potentially impacting the startup performance by pre-resolving symbols.
the spark of OPT-175B
近期的一些杂项
infra decouple from kind of internal production system
重点并不是 infra 如何帮忙自动恢复, 只是略有提到; 重点还是训练的调参
famous uncorrectable ECC error
we just restart the run
try to make run stable (数学上的稳定)
FP16
Lost GPU
CUDA errors
Job hanging
NCCL error
Job Slowdown
High DRAM correctable errors etc.
blob storage issues
when we are training these models, we kind of just stare at tensorboard all day
in general the mixture of hardware issues, training like numerical converting issues
~30days change the hyperparameter to try to get through
56days, 53 - 54 restarts, OPT-175B survived 143K steps
Andrej Karpathy
LLM
LAMA-2-70B
fp16, 2bytes, 70B
2 * 70B = 140B bytes = 140 * 1,000,000,000 bytes = 140,000,000,000 bytes = 140 gigabytes (bytes, kbytes, mbytes, gbytes)
140GB
tokenize
encoder, 将字符串转换为整数编码
decoder, 将整数编码转为字符串
H800
Terms
- SXM: Server PCI Express Module, a high bandwidth socket solution for connecting Nvidia Compute Accelerators to a system
- NVL: NVLink is a wire-based serial multi-lane near-range communications link developed by Nvidia. Unlike PCI Express, a device can consist of multiple NVLinks, and devices use mesh networking to communicate instead of a central hub.
- PCIe: PCI Express (Peripheral Component Interconnect Express), officially abbreviated as PCIe or PCI-e,[1] is a high-speed serial computer expansion bus standard
from Wikipedia
H800 vs H100
https://resources.nvidia.com/en-us-tensor-core/nvidia-tensor-core-gpu-datasheet
NVIDIA H100 Tensor Core GPU
H800 没找到 NVIDIA 官网 Specification, 只能从代理商和一些B站UP主看到的数据
| H800 SXM | H100 SXM | |
|---|---|---|
| FP64 | 1 teraFLOPS | 34 teraFLOPS |
| FP64 Tensor Core | 1 teraFLOPS | 67 teraFLOPS |
| FP32 | 67 teraFLOPS | 67 teraFLOPS |
| TF32 Tensor Core | 989 teraFLOPS | 989 teraFLOPS |
| BFLOAT16 Tensor Core | 1,979 teraFLOPS | 1,979 teraFLOPS |
| FP16 Tensor Core | 1,979 teraFLOPS | 1,979 teraFLOPS |
| FP8 Tensor Core | 3,958 teraFLOPS | 3,958 teraFLOPS |
| INT8 Tensor Core | 3,958 TOPS | 3,958 TOPS |
| GPU memory | 80GB | 80GB |
| GPU memory bandwidth | 3.35TB/s | 3.35TB/s |
| Interconnect | NVLink 400GB/s PCIe Gen5: 128GB/s | NVLink 900GB/s PCIe Gen5: 128GB/s |
- H800 FP64 算力限制
Driver
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
https://www.nvidia.com/content/dam/en-zz/Solutions/gtcs22/data-center/h100/PB-11133-001_v01.pdf
Software Specifications
| Specification | Description |
|---|---|
| Driver support | Linux: R520 or later |
CUDA
https://docs.nvidia.com/datacenter/tesla/drivers/index.html#cuda-arch-matrix
| Architecture | CUDA Capabilities | First CUDA Toolkit Support |
|---|---|---|
| Hopper | 9.0 | CUDA 11.8 CUDA 12.0 |
TensorFlow
https://www.tensorflow.org/install/source#tested_build_configurations
| Version | Python version | Compiler | Build tools | cuDNN | CUDA |
|---|---|---|---|---|---|
| tensorflow-2.15.0 | 3.9-3.11 | Clang 16.0.0 | Bazel 6.1.0 | 8.8 | 12.2 |
| tensorflow-2.14.0 | 3.9-3.11 | Clang 16.0.0 | Bazel 6.1.0 | 8.7 | 11.8 |
| tensorflow-2.13.0 | 3.8-3.11 | Clang 16.0.0 | Bazel 5.3.0 | 8.6 | 11.8 |
| tensorflow-2.12.0 | 3.8-3.11 | GCC 9.3.1 | Bazel 5.3.0 | 8.6 | 11.8 |
| tensorflow-2.11.0 | 3.7-3.10 | GCC 9.3.1 | Bazel 5.3.0 | 8.1 | 11.2 |
| tensorflow-2.6.0 | 3.6-3.9 | GCC 7.3.1 | Bazel 3.7.2 | 8.1 | 11.2 |
candidates on H800
- >= tensorflow-2.12.0
1 | docker pull tensorflow/tensorflow:2.14.0-gpu |
PyTorch
https://pytorch.org/get-started/previous-versions/
| Version | CUDA |
|---|---|
| v1.13.1 | 11.6, 11.7 |
| v2.0.0 | 11.7, 11.8 |
| v2.0.1 | 11.7, 11.8 |
| v2.1.0 | 11.8, 12.1 |
| v2.1.1 | 11.8, 12.1 |
candidates on H800
- >= v2.0.0, with cuda 11.8 support
1 | docker pull pytorch/pytorch:2.1.0-cuda11.8-cudnn8-devel |
modelarts python sdk job demo
1 | from modelarts.session import Session |
roce flow control
个人理解记录
所谓无损, 也就是不丢包; 通过 global pause, pfc, dcqcn 等不断演进的流控/拥塞控制协议, 来保障在丢包之前控制源头降速, 避免丢包
- https://enterprise-support.nvidia.com/s/article/howto-configure-dcqcn--roce-cc--values-for-connectx-4--linux-x
- https://enterprise-support.nvidia.com/s/article/dcqcn-parameters
- https://enterprise-support.nvidia.com/s/article/DCQCN-CC-algorithm
ifconfig vs ethtool
https://enterprise-support.nvidia.com/s/article/ibdev2netdev
ibdev2netdev
执行上述命令可查询得到 it maps the adapter port to the net device
对于 infiniband 类型的 link layer, 一般来说上述命令得到的是 ib0 设备, 即 IPoIB 虚拟网卡; 对于 ethernet 类型的 link layer,一般来说上述命令得到的是 ens[xxx] 网卡设备
另外注意到 ifconfig ens[xxx] 中显示的 Tx 与 Rx, 实际上与 ethtool -S ens[xxx] 中的如下值一致
https://enterprise-support.nvidia.com/s/article/understanding-mlx5-ethtool-counters
- rx_bytes: Representor only: bytes received, that were handled by the hypervisor. supported from kernel 4.18
- tx_bytes: Representor only: bytes transmitted, that were handled by the hypervisor. supported from kernel 4.18
经过实际测试,在使用 rdma 网卡通信时,上述两值并没有明显的计数增加,而观察到 ethtool counters rx_bytes_phy / tx_bytes_phy 才有与实际流量相当的计数增加。所以可能早期(或者内核?) ifconfig 中获取到的数值,仅是网卡的其中某个计数器,而那个计数器又并不能代表真正的实际情况,所以可能 ifconfig 中的数值会是个误导。我们应使用 ethtool -S ens[xxx] 查看 rdma 网卡的统计信息。
rx_bytes_phy, ConnectX-3 naming : rx_bytes
例如在 cx3 网卡时,当前主机安装的 ifconfig,取的的确就是“正确”的;而在 cx4/5/6 网卡,rx_bytes 的物理意义发生了变化,变为了记录 Representor only: bytes received, that were handled by the hypervisor. supported from kernel 4.18
交换机端口常用查询命令
https://support.huawei.com/enterprise/zh/doc/EDOC1100153180/e4418444
https://www.infoq.cn/article/o3rnxl2trb1gxemmxdoj
egress/ingress port
查看是否出现丢包
1 | display interface 100GE1/0/1 |
1 | Input: |
上行方向丢包 Input
1 | display qos buffer ingress-statistics interface 100GE1/0/1 |
查看入方向统计值
1 | Interface Dropped Drop Rate Drop Time |
下行出现丢包 Output
1 | display qos queue statistics interface 100GE1/0/1 |
查看队列统计情况
1 | ---------------------------------------------------------------------------------------------- |
查看接口出方向队列的缓存使用情况
1 | display qos buffer egress-usage interface 100GE1/0/1 |
可以查看无损队列
1 | Egress Buffer Usage (KBytes) on single queue: (Current/Total) |