0%

Docker Resource Limit

CPU

https://docs.docker.com/config/containers/resource_constraints/#cpu

  • --cpu-period: CFS 调度算法中的 cpu 时间分片,默认为 100ms
  • --cpu-quota: CFS quota,在每个 cpu period 分片中,在 cpu 限流前,docker container 能使用的 cpu 时间
  • --cpuset-cpus: docker container binding to cpu core
  • --cpu-shares: Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit.

Memory

https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory

  • --memory: The maximum amount of memory the container can use (cgroup limit)
  • --memory-swap:
  • --oom-kill-disable:

针对 OOM 补充说明如下

  1. 容器内进程使用 memory 超过限制,kernel 会触发 oom killer (cgroup),kill oom_score 高分进程
  2. 容器只要 1 pid 进程未退出,则容器不会退出

OOM 始终针对的是进程,而非容器

Docker Container OOMKilled status

  1. https://stackoverflow.com/questions/48618431/what-does-oom-kill-disable-do-for-a-docker-container
  2. https://github.com/moby/moby/issues/14440#issuecomment-119243820
  3. https://plumbr.io/blog/java/oomkillers-in-docker-are-more-complex-than-you-thought
  4. https://zhimin-wen.medium.com/memory-limit-of-pod-and-oom-killer-891ee1f1cad8
  5. https://faun.pub/understanding-docker-container-memory-limit-behavior-41add155236c
  6. https://github.com/moby/moby/issues/15621#issuecomment-181418985
  7. https://draveness.me/docker/
  8. https://github.com/moby/moby/issues/38352#issuecomment-446329512
  9. https://github.com/containerd/cgroups/issues/74
  10. https://github.com/kubernetes/kubernetes/issues/78973
  11. https://github.com/kubernetes/kubernetes/issues/50632

容器内的子进程发生了 oom killed,在 docker container 退出时也会被设置 OOMKilled 标志;参考该 issue

https://github.com/moby/moby/issues/15621#issuecomment-181418985

在 docker container 未退出时会设置 container event

https://docs.docker.com/engine/reference/commandline/events/

docker container 设置 OOMKilled 原理,参考该 issue

https://github.com/moby/moby/issues/38352#issuecomment-446329512

在实现上

  1. containerd 监听了一系列事件,假若获取到 cgroup oom event 则记录 OOMKilled = true
  2. containerd 将处理后的事件发送至 dockerd 进一步处理
  3. dockerd 在处理 OOM 事件时,记录 container oom 事件
  4. dockerd 在处理 Exit 事件时,将 OOMKilled = true 写入容器的 status

K8S Resource Limit

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run

CPU (Docker Container config)

CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

  • --cpu-shares: max({requests.cpu} * 1024, 2)

例如 requests 为 180,则 --cpu-shares=184320

  • --cpu-period: 100

  • --cpu-quota: limits.cpu * 100

https://stackoverflow.com/a/63352630

The resulting value is the total amount of CPU time in microseconds that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

The default quota period is 100ms. The minimum resolution of CPU quota is 1ms.

cpu 时间分片为 period,quota 为实际每个 period 周期中,可使用的 cpu time;假若受到 qutoa 限制的 cpu 任务,在当前 period 的 quota 仍未完成,则当前任务挂起,等待下个 period 继续执行

multi cpu 机器注意 quota 可以是 period 的倍数,例如限制 container 使用 0.5 cpu,则 --cpu-quota=50,假若主机有 20 cpu,限制 container 使用 10 cpu,则 --cpu-quota=10*100=1000

Memory (Docker Container config)

  • --memory: int({limits.memory})
  • --memory-swap: int({limits.memory})

the container does not have access to swap

K8s OOM Watcher

https://github.com/kubernetes/kubernetes/blob/v1.22.1/pkg/kubelet/oom/oom_watcher_linux.go

  • /dev/kmsg

Start watches for system oom’s and records an event for every system oom encountered.

当 kubelet 观测到节点发生 system oom 时(非 cgroup oom),生成 event;可通过 kubectl 工具查询

1
kubectl get event --field-selector type=Warning,reason=SystemOOM

如下 PR 尝试将 pod 内进程 oom 关联至 pod,未合入

https://github.com/kubernetes/kubernetes/issues/100483

https://github.com/kubernetes/kubernetes/pull/100487

Download

https://rsync.samba.org/

最新版本:Rsync version 3.2.3 released

How rsync works

https://rsync.samba.org/how-rsync-works.html

Guide

https://download.samba.org/pub/rsync/rsync.html

  • --recursive: recurse into directories
  • --append: append data onto shorter files
  • --filter
1
/usr/local/Cellar/rsync/3.2.3/bin/rsync --verbose --no-whole-file --recursive --append --include='*.log' --include='*/' --exclude='*' --prune-empty-dirs dir1/ dir2/

注意 rsync 本地目录的特殊之处

https://superuser.com/questions/234273/why-doest-rsync-use-delta-transfer-for-local-files

–whole-file, This is the default when both the source and destination are specified as local paths, but only if no batch-writing option is in effect.

High Availability

https://unix.stackexchange.com/questions/48298/can-rsync-resume-after-being-interrupted

pod spec of volcano job

https://github.com/volcano-sh/volcano/blob/v1.3.0/pkg/controllers/job/job_controller_util.go

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
import (
v1 "k8s.io/api/core/v1"
...
)

// MakePodName append podname,jobname,taskName and index and returns the string.
func MakePodName(jobName string, taskName string, index int) string {
return fmt.Sprintf(jobhelpers.PodNameFmt, jobName, taskName, index)
}

func createJobPod(job *batch.Job, template *v1.PodTemplateSpec, ix int) *v1.Pod {
templateCopy := template.DeepCopy()

pod := &v1.Pod{
ObjectMeta: metav1.ObjectMeta{
Name: jobhelpers.MakePodName(job.Name, template.Name, ix),
Namespace: job.Namespace,
OwnerReferences: []metav1.OwnerReference{
*metav1.NewControllerRef(job, helpers.JobKind),
},
Labels: templateCopy.Labels,
Annotations: templateCopy.Annotations,
},
Spec: templateCopy.Spec,
}

...
}

sysctl of pod spec

https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/

1
2
3
4
5
6
7
8
9
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: net.ipv4.ip_local_port_range
value: "30000 50000"

find out current ip local port range

1
cat /proc/sys/net/ipv4/ip_local_port_range

https://www.thegeekdiary.com/how-to-reserve-a-port-range-for-a-third-party-application-in-centos-rhel/

Note: ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments.

deep residual learning framework

https://pkg.go.dev/os/signal#hdr-Default_behavior_of_signals_in_Go_programs

https://pkg.go.dev/os/signal#hdr-Changing_the_behavior_of_signals_in_Go_programs

By default, a synchronous signal is converted into a run-time panic. A SIGHUP, SIGINT, or SIGTERM signal causes the program to exit.

Notify disables the default behavior for a given set of asynchronous signals and instead delivers them over one or more registered channels. Specifically, it applies to the signals SIGHUP, SIGINT, SIGQUIT, SIGABRT, and SIGTERM.

但是别忘了会有 race 的情况, 下边通过 bash shell 脚本来启动 golang 进程做一个示例

test-signal

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
package main

import (
"fmt"
"os"
"os/signal"
"syscall"
"time"
)

func main() {
signalCh := make(chan os.Signal, 2)
signal.Notify(signalCh, syscall.SIGINT, syscall.SIGTERM)
fmt.Printf("notify signals\n")

go func() {
sig := <- signalCh
fmt.Printf("receive signal %v\n", sig)
}()

fmt.Printf("wait signal\n")
time.Sleep(time.Minute)
}

test1.sh

1
2
3
4
5
6
7
8
9
10
./test-signal &
pid=$!

echo "test-signal pid: $pid"

kill $pid
wait $pid

exit_code=$?
echo "test-signal exit_code: $exit_code"

test2.sh

1
2
3
4
5
6
7
8
9
10
11
12
13
14
./test-signal &
pid=$!

echo "test-signal pid: $pid"

# important
sleep 1
#

kill $pid
wait $pid

exit_code=$?
echo "test-signal exit_code: $exit_code"

test1.sh 的执行结果

1
2
3
test-signal pid: 4878
test1.sh: line 7: 4878 Terminated: 15 ./test-signal
test-signal exit_code: 143

test2.sh 的执行结果

1
2
3
4
test-signal pid: 4880
notify signals
wait signal
receive signal terminated

Summary

  1. golang 程序处理 TERM 信号的默认行为是退出, 且退出码为 143 (128 + 15), 15 为 TERM
  2. 使用 signal.Notify 可以修改 golang 程序处理 TERM 信号的默认行为; 但是如果 golang 程序启动后过快接收到 TERM 信号 (在 signal.Notify 执行完成之前), 则会导致程序直接退出 (默认行为)

https://v1.gorm.io/docs/

https://v1.gorm.io/docs/logger.html

Refer GORM’s default logger for how to customize it

https://github.com/jinzhu/gorm/blob/v1.9.16/logger.go

gorm v1 print log

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
func (s *DB) print(v ...interface{}) {
s.logger.Print(v...)
}

func (s *DB) log(v ...interface{}) {
if s != nil && s.logMode == detailedLogMode {
s.print(append([]interface{}{"log", fileWithLineNum()}, v...)...)
}
}

func (s *DB) slog(sql string, t time.Time, vars ...interface{}) {
if s.logMode == detailedLogMode {
s.print("sql", fileWithLineNum(), NowFunc().Sub(t), sql, vars, s.RowsAffected)
}
}

gorm v1 print error

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
// AddError add error to the db
func (s *DB) AddError(err error) error {
if err != nil {
if err != ErrRecordNotFound {
if s.logMode == defaultLogMode {
go s.print("error", fileWithLineNum(), err)
} else {
s.log(err)
}

errors := Errors(s.GetErrors())
errors = errors.Add(err)
if len(errors) > 1 {
err = errors
}
}

s.Error = err
}
return err
}

gorm v1 print sql

1
2
3
4
5
6
// trace print sql log
func (scope *Scope) trace(t time.Time) {
if len(scope.SQL) > 0 {
scope.db.slog(scope.SQL, t, scope.SQLVars...)
}
}

因此在打开 gorm v1 LogMode 的时候

1
2
3
4
5
6
7
8
9
// LogMode set log mode, `true` for detailed logs, `false` for no log, default, will only print error logs
func (s *DB) LogMode(enable bool) *DB {
if enable {
s.logMode = detailedLogMode
} else {
s.logMode = noLogMode
}
return s
}

会进入到 s.print log, s.print sql 的打印逻辑

https://www.soberkoder.com/go-gorm-logging/

如若需要自定义 gorm v1 logger 可以参考如下代码段

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
// GormLogger struct
type GormLogger struct{}

// Print - Log Formatter
func (*GormLogger) Print(v ...interface{}) {
if v[0] == "sql" {
log.WithFields(
log.Fields{
"module": "gorm",
"type": "sql",
"rows_returned": v[5],
"src": v[1],
//"values": v[4],
"duration": v[2],
},
).Info(v[3])
} else {
log.WithFields(log.Fields{"module": "gorm", "type": "log", "src": v[1]}).Print(v[2:]...)
}
}

另外也可以根据 duration 实现客户端的 slow sql 打印

mlnx cx adapter card firmware

https://network.nvidia.com/support/firmware/connectx4ib/

  • 12.28.2006 – newer, current versions
  • 12.28.1002 – old
  • 12.27.4000 – older

注意到 mlnx ofed Additional Firware Version Supported 一般是前几个 firmware 版本中的一个

https://network.nvidia.com/support/firmware/connectx5ib/

  • 16.28.2006 – newer
  • 16.28.1002 – old

https://network.nvidia.com/support/firmware/connectx6dx/

  • 22.28.2006 – newer
  • 22.28.1002 – old

mlnx ofed

容器镜像中安装的 mlnx ofed,与宿主机中安装的 mlnx ofed 有何联系?

其实并无联系,仅仅与宿主机 mlnx 网卡型号,及其 firmware 版本有关系

以 mlnx ofed LTS version 5.4-3.1.0.0 为例,在其 Release Notes 中明确提到了该 ofed 配套的 firmware 版本

不同 OS 版本,均指向同一 Release Notes

https://docs.nvidia.com/networking/display/MLNXOFEDv543100/Release+Notes

支持的网卡及其速率

  • ConnectX-4
    • Infiniband: …
    • Ethernet: 100Gb, …
  • ConnectX-5
    • Infiniband: …
    • Ethernet: 100Gb, …
  • ConnectX-6 Dx
    • Ethernet: 100Gb, …

https://docs.nvidia.com/networking/display/MLNXOFEDv543100/General+Support#GeneralSupport-SupportedNICFirmwareVersions

Upgrading MLNX_OFED on a cluster requires upgrading all of its nodes to the newest version as well

This current version is tested with the following NVIDIA NIC firmware versions

Firmware versions listed are the minimum supported versions

NIC Recommended Firmware Version Additional Firmware Version Supported
cx4 12.28.2006 12.28.2006
cx5 16.31.2006 16.31.1014
cx6 dx 22.31.2006 22.31.1014

该 mlnx ofed 5.4-3.1.0.0 驱动要求的 最小 firmware 版本,但是注意到 mlnx ofed 从 5.4 版本才开始增加 minimum supported versions 的描述

与之相比,mlnx ofed LTS version 4.9-4.1.7.0 驱动要求的 firmware 版本如下

https://docs.nvidia.com/networking/display/MLNXOFEDv494170/General+Support+in+MLNX_OFED#GeneralSupportinMLNX_OFED-SupportedNICsFirmwareVersions

NIC Recommended Firmware Version Additional Firmware Version Supported
cx4 12.28.2006 12.27.4000
cx5 16.28.2006 16.27.2008
cx6 dx 22.28.2006 NA

identifying adapter cards

https://network.nvidia.com/support/firmware/identification/

1
ibv_devinfo

目标:构建有如下软件的容器镜像,并使用华为云 ModelArts 训练服务运行

  • ubuntu-18.04
  • cuda-10.2
  • python-3.7.13
  • pytorch-1.8.1

1. 准备 context 文件夹

1
mkdir -p context

1.1. 准备文件

1.1.1. pip.conf

使用华为开源镜像站 pypi 配置

https://mirrors.huaweicloud.com/home

文件内容如下

1
2
3
4
[global]
index-url = https://repo.huaweicloud.com/repository/pypi/simple
trusted-host = repo.huaweicloud.com
timeout = 120

1.1.2. torch*.whl

https://pytorch.org/get-started/previous-versions/#v181

在该地址上 https://download.pytorch.org/whl/torch_stable.html 搜索并下载如下 whl

  • torch-1.8.1+cu102-cp37-cp37m-linux_x86_64.whl
  • torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl
  • torchvision-0.9.1+cu102-cp37-cp37m-linux_x86_64.whl

1.1.3. Miniconda3

https://docs.conda.io/en/latest/miniconda.html

Miniconda3-py37_4.12.0-Linux-x86_64.sh

使用该地址 https://repo.anaconda.com/miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh, 下载 miniconda3 安装文件

1.2. context 文件夹内容

将上述文件放置在 context 文件夹内

1
2
3
4
5
6
context
├── Miniconda3-py37_4.12.0-Linux-x86_64.sh
├── pip.conf
├── torch-1.8.1+cu102-cp37-cp37m-linux_x86_64.whl
├── torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl
└── torchvision-0.9.1+cu102-cp37-cp37m-linux_x86_64.whl

2. 编写容器镜像 Dockerfile 文件

在 context 文件夹内新建名为 Dockerfile 的空文件,并将下述文件内容写入其中

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
# 容器镜像构建主机需要连通公网

# 基础容器镜像, https://github.com/NVIDIA/nvidia-docker/wiki/CUDA
#
# https://docs.docker.com/develop/develop-images/multistage-build/#use-multi-stage-builds
# require Docker Engine >= 17.05
#
# builder stage
FROM nvidia/cuda:10.2-runtime-ubuntu18.04 AS builder

# 基础容器镜像的默认用户已经是 root
# USER root

# 使用华为开源镜像站提供的 pypi 配置
RUN mkdir -p /root/.pip/
COPY pip.conf /root/.pip/pip.conf

# 拷贝待安装文件到基础容器镜像中的 /tmp 目录
COPY Miniconda3-py37_4.12.0-Linux-x86_64.sh \
torch-1.8.1+cu102-cp37-cp37m-linux_x86_64.whl \
torchvision-0.9.1+cu102-cp37-cp37m-linux_x86_64.whl \
torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl \
./tmp

# https://conda.io/projects/conda/en/latest/user-guide/install/linux.html#installing-on-linux
# 安装 Miniconda3 到基础容器镜像的 /home/ma-user/miniconda3 目录中
RUN bash /tmp/Miniconda3-py37_4.12.0-Linux-x86_64.sh -b -p /home/ma-user/miniconda3

# 使用 Miniconda3 默认 python 环境 (即 /home/ma-user/miniconda3/bin/pip) 安装 torch*.whl
RUN cd /tmp && \
/home/ma-user/miniconda3/bin/pip install --no-cache-dir \
/tmp/torch-1.8.1+cu102-cp37-cp37m-linux_x86_64.whl \
/tmp/torchvision-0.9.1+cu102-cp37-cp37m-linux_x86_64.whl \
/tmp/torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl

# 构建最终容器镜像
FROM nvidia/cuda:10.2-runtime-ubuntu18.04

# 安装 vim / curl 工具(依然使用华为开源镜像站)
RUN cp -a /etc/apt/sources.list /etc/apt/sources.list.bak && \
sed -i "s@http://.*archive.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
sed -i "s@http://.*security.ubuntu.com@http://repo.huaweicloud.com@g" /etc/apt/sources.list && \
apt-get update && \
apt-get install -y vim curl && \
apt-get clean && \
mv /etc/apt/sources.list.bak /etc/apt/sources.list

# 增加 ma-user 用户 (uid = 1000, gid = 100)
# 注意到基础容器镜像已存在 gid = 100 的组,因此 ma-user 用户可直接使用
RUN useradd -m -d /home/ma-user -s /bin/bash -g 100 -u 1000 ma-user

# 从上述 builder stage 中拷贝 /home/ma-user/miniconda3 目录到当前容器镜像的同名目录
COPY --chown=ma-user --from=builder /home/ma-user/miniconda3 /home/ma-user/miniconda3

# 设置容器镜像预置环境变量
# 请务必设置 PYTHONUNBUFFERED=1, 以免日志丢失
ENV PATH=$PATH:/home/ma-user/miniconda3/bin \
PYTHONUNBUFFERED=1

# 设置容器镜像默认用户与工作目录
USER ma-user
WORKDIR /home/ma-user

3. 构建容器镜像

context 文件夹内容如下

1
2
3
4
5
6
7
context
├── Dockerfile
├── Miniconda3-py37_4.12.0-Linux-x86_64.sh
├── pip.conf
├── torch-1.8.1+cu102-cp37-cp37m-linux_x86_64.whl
├── torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl
└── torchvision-0.9.1+cu102-cp37-cp37m-linux_x86_64.whl

执行如下命令构建容器镜像

1
2
3
4
5
# 执行构建容器镜像命令之前,请务必切换到 context 目录内
cd context

# 执行构建容器镜像命令
docker build . -t swr.cn-north-4.myhuaweicloud.com/deep-learning-demo/pytorch:1.8.1-cuda10.2

容器镜像构建成功后,可通过如下命令查询到对应的容器镜像地址

1
docker images | grep pytorch | grep 1.8.1-cuda10.2

3. pytorch verification code

https://pytorch.org/get-started/locally/#linux-verification

验证示例代码:pytorch-verification.py

1
2
3
4
5
6
7
8
9
import torch
import torch.nn as nn

x = torch.randn(5, 3)
print(x)

available_dev = torch.device("cuda") if torch.cuda.is_available() else torch.device("cpu")
y = torch.randn(5, 3).to(available_dev)
print(y)

4. boot command in modelarts training service

1
/home/ma-user/miniconda3/bin/python ${MA_JOB_DIR}/code/pytorch-verification.py

cpu 训练作业日志显示示例

1
2
3
4
5
6
7
8
9
10
tensor([[ 0.8945, -0.6946,  0.3807],
[ 0.6665, 0.3133, 0.8285],
[-0.5353, -0.1730, -0.5419],
[ 0.4870, 0.5183, 0.2505],
[ 0.2679, -0.4996, 0.7919]])
tensor([[ 0.9692, 0.4652, 0.5659],
[ 2.2032, 1.4157, -0.1755],
[-0.6296, 0.5466, 0.6994],
[ 0.2353, -0.0089, -1.9546],
[ 0.9319, 1.1781, -0.4587]])

gpu 训练作业日志显示示例

1
2
3
4
5
6
7
8
9
10
tensor([[-0.2874, -0.3475,  0.1848],
[-0.1660, -0.5038, -0.5470],
[ 0.1289, -0.2400, 2.0829],
[ 1.6870, -0.0492, 0.1189],
[ 0.4800, -0.3611, -0.9572]])
tensor([[-0.6710, 0.4095, -0.7370],
[ 1.4353, 0.9093, 1.7551],
[ 1.3477, -0.0499, 0.2404],
[ 1.7489, -1.0203, -0.7875],
[-1.2104, 0.4593, 1.1365]], device='cuda:0')