MegaThinking

better tokens, better intelligence, contributing superior tokens to models

Type size

https://golang.org/ref/spec#Size_and_alignment_guarantees

https://github.com/ardanlabs/gotraining-studyguide/blob/master/go/language/struct.go

1
2
3
4
5
type example struct {
flag bool
counter int16
pi float32
}

字节对齐系数 #pragma pack(n)

  • 成员对齐
  • 结构体对齐

对齐系数规则

  1. For a variable x of any type: unsafe.Alignof(x) is at least 1.
  2. For a variable x of struct type: unsafe.Alignof(x) is the largest of all the values unsafe.Alignof(x.f) for each field f of x, but at least 1.
  3. For a variable x of array type: unsafe.Alignof(x) is the same as the alignment of a variable of the array’s element type.

layout

  • bool(0)
  • int16(2)
  • float32(4)

8 bytes

https://eddycjy.gitbook.io/golang/di-1-ke-za-tan/go-memory-align

//TODO list

Download

https://rsync.samba.org/

最新版本:Rsync version 3.2.3 released

How rsync works

https://rsync.samba.org/how-rsync-works.html

Guide

https://download.samba.org/pub/rsync/rsync.html

  • --recursive: recurse into directories
  • --append: append data onto shorter files
  • --filter
1
/usr/local/Cellar/rsync/3.2.3/bin/rsync --verbose --no-whole-file --recursive --append --include='*.log' --include='*/' --exclude='*' --prune-empty-dirs dir1/ dir2/

注意 rsync 本地目录的特殊之处

https://superuser.com/questions/234273/why-doest-rsync-use-delta-transfer-for-local-files

–whole-file, This is the default when both the source and destination are specified as local paths, but only if no batch-writing option is in effect.

High Availability

https://unix.stackexchange.com/questions/48298/can-rsync-resume-after-being-interrupted

deep residual learning framework

AWS SageMaker

AWS SageMaker 训练容器镜像设计体验

不是优点的优点: 看起来只支持同构资源,训练资源分配模型为单节点单容器,理解上简单

优势

  • 容器镜像功能层次丰富,每个层次都有文档描述如何实施,Level 0 对容器镜像约束最少,自定义程度最高
    • Level 0: 完全自定义容器镜像,容器镜像指定 Entrypoint,Entrypoint 命令能处理 train 参数即可 link
    • Level 1: 改造已有容器镜像,使得其可利用 SageMaker Toolkits 来进行训练与推理(即把已有镜像改造为 SageMaker 镜像)link
    • Level 2: 使用预置的 SageMaker 容器镜像
    • Level 3: 使用扩展的预置 SageMaker 容器镜像(基于预置的容器镜像扩展功能)
  • 训练启动脚本 (Training Toolkits) 开源,并且可通过 pip install sagemaker-training 直接安装,常用深度学习引擎有独立的 toolkits,均包含在 training toolkits 中 link
  • 可在 Notebook 中直接构建容器镜像
  • 可在 local machine 测试容器镜像基本功能(可能仅限单机训练?)

主打场景

AI 开发者:文档详细且丰富(技术向),容器镜像可玩度高(约束少)

Azure Machine Learning

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-with-custom-image

Azure 与 conda 结合,有个 Environment 的概念,对容器镜像有如下约束

  • Ubuntu 16.04 or greater.
  • Conda 4.5.# or greater.
  • Python 3.5+.

当然如果不使用 Environment,也就无上述约束

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-tensorflow#distributed-training

资源分配模式看起来也是单节点单容器

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/ml-frameworks/tensorflow/distributed-tensorflow-with-horovod

TODO …

https://developer.nvidia.com/gpudirect

环境信息

  • Kernel: 3.10.0-514.44.5.10.h254.x86_64 (uname -r)
  • Nvidia Driver: 440.33.01 (nvidia-smi)
  • MLNX OFED: 4.3-1.0.1.0 (ofed_info)
  • Mellanox/nv_peer_memory: Tag 1.1-0

容器化安装 NVIDIA Driver 看起来会出现 lsmod | grep nvidia 能找到,然而 modinfo nvidia 会提示找不到 Module 的错误

需要修改 nv_peer_memory 代码库的构建脚本,workaround 上述问题

DIY nv_peer_memory 编译

准备空目录

1
2
mkdir -p /root/nv_peer_memory
cd /root/nv_peer_memory

NVIDIA Driver

https://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run

1
2
3
4
5
# 下载 `NVIDIA-Linux-x86_64-440.33.01.run`
curl -o NVIDIA-Linux-x86_64-440.33.01.run 'https://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run'

# 解压至当前目录
./NVIDIA-Linux-x86_64-440.33.01.run -x

nv_peer_memory

https://github.com/Mellanox/nv_peer_memory/tree/1.1-0

https://www.mellanox.com/products/GPUDirect-RDMA

1
2
curl -o nv_peer_memory-1.1-0.tar.gz 'https://github.com/Mellanox/nv_peer_memory/archive/1.1-0.tar.gz'
tar xzf nv_peer_memory-1.1-0.tar.gz

DIY 编译

1
cd nv_peer_memory-1.1-0

修改 Makefile 中的 nv_sources 为 NVIDIA Driver 源码位置

1
nv_sources=/root/nv_peer_memory/NVIDIA-Linux-x86_64-440.33.01/kernel

修改 create_nv.symvers.sh 中的 nvidia_mod 为主机上安装的 NVIDIA Driver .ko 位置,例如

1
nvidia_mod=/var/k8s/nvidia/drivers/nvidia.ko

编译

参考 nv_peer_memory README.md

1
2
3
./build_module.sh

rpmbuild --rebuild /tmp/nvidia_peer_memory-1.1-0.src.rpm

安装 rpm

1
rpm -ivh /root/rpmbuild/RPMS/x86_64/nvidia_peer_memory-1.1-0.x86_64.rpm

测试

1
lsmod | grep nv_peer_mem

NCCL_DEBUG=INFO,例如

NCCL version 2.4.8+cuda10.1

1
NCCL INFO Ring 00 : 3 -> 10 [send] via NET/IB/0/GDRDMA

Trick

  • nvidia_peer_memory 代码中的 create_nv.symvers.sh 可独立执行,由于容器化安装 NVIDIA Driver 场景,modinfo nvidia 会报找不到 mod 的错,可找一台直接在主机侧安装了 NVIDIA driver 的机器,bash -x create_nv.symvers.sh 确认执行过程,以及相关变量取值

  • 如下命令可显示 mod 对应的 ko 文件位置

1
2
$/sbin/modinfo -F filename -k 3.10.0-514.44.5.10.h142.x86_64 nvidia
/lib/modules/3.10.0-514.44.5.10.h142.x86_64/kernel/drivers/video/nvidia.ko

init_process_group

store TCPStore

rank == 0 作为 TCPStore rendezvous handler 的 server

hostname

port

tcp://

rank

world_size

TCPStore

isServer 为 True 时,内部启动 TCPStoreDaemon

waitWorkerReady 为 True 时,10ms 轮询一次是否获取到足够到 workerNumber

https://github.com/linux-rdma/perftest/blob/master/src/write_bw.c

main

1
2
user_param.verb = WRITE;
user_param.tst = BW;

parser

1
2
-c, --connection=<RC/XRC/UC/DC> Connection type RC/XRC/UC/DC (default RC)
-s, --size=<size> Size of message to exchange (default 65536)

init_perftest_params

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
...

#define DEF_SIZE_BW (65536)
#define DEF_SIZE_LAT (2)
#define DEF_CACHE_LINE_SIZE (64)
#define DEF_PAGE_SIZE (4096)
#define DEF_FLOWS (1)

...

user_param->size = (user_param->tst == BW ) ? DEF_SIZE_BW : DEF_SIZE_LAT;

user_param->connection_type = (user_param->connection_type == RawEth) ? RawEth : RC;

...

user_param->cache_line_size = get_cache_line_size();
user_param->cycle_buffer = sysconf(_SC_PAGESIZE);

if (user_param->cycle_buffer <= 0) {
user_param->cycle_buffer = DEF_PAGE_SIZE;
}

...

user_param->flows = DEF_FLOWS;

get_cache_line_size()

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
	int size = 0;
#if !defined(__FreeBSD__)
size = sysconf(_SC_LEVEL1_DCACHE_LINESIZE);
if (size == 0) {
#if defined(__sparc__) && defined(__arch64__)
char* file_name =
"/sys/devices/system/cpu/cpu0/l2_cache_line_size";
#else
char* file_name =
"/sys/devices/system/cpu/cpu0/cache/index0/coherency_line_size";
#endif

FILE *fp;
char line[10];
fp = fopen(file_name, "r");
if (fp == NULL) {
return DEF_CACHE_LINE_SIZE;
}
if(fgets(line,10,fp) != NULL) {
size = atoi(line);
fclose(fp);
}
}
#endif
if (size <= 0)
size = DEF_CACHE_LINE_SIZE;

getconf LEVEL1_DCACHE_LINESIZE

main -> alloc_ctx

1
2
3
4
5
6
7
8
9
10
ctx->size = user_param->size;

num_of_qps_factor = (user_param->mr_per_qp) ? 1 : user_param->num_of_qps;

/* holds the size of maximum between msg size and cycle buffer,
* aligned to cache line,
* it is multiply by 2 for send and receive
* with reference to number of flows and number of QPs */
ctx->buff_size = INC(BUFF_SIZE(ctx->size, ctx->cycle_buffer),
ctx->cache_line_size) * 2 * num_of_qps_factor * user_param->flows;

65536 = 64Kb

generally, 16 pages

root cause: ulimit -l is 16 (default) in container

本地复现一个 TcpStore 的测试用例问题,修改了部分代码,因此需要源码编译 PyTorch

环境信息

  • macOS 10.15.7
  • XCode 12.2 (12B45b)

源码信息

master latest commit (2020-11-14): f8248543a13b0144a6f5d0a549f72b1e470d88aa

1
2
3
commit f8248543a13b0144a6f5d0a549f72b1e470d88aa (github/master, github/gh/ljk53/194/base, github/HEAD, master)
Author: Rohan Varma <rvarm1@fb.com>
Date: Sat Nov 14 13:36:31 2020 -0800

构建

(1) https://github.com/pytorch/pytorch#from-source

(2) https://github.com/pytorch/pytorch/blob/master/CONTRIBUTING.md#c-development-tips

glog

1
brew install glog

conda

1
2
3
4
5
6
7
8
conda create -n pytorch-dev python=3.6

conda activate pytorch-dev

conda install numpy ninja pyyaml mkl mkl-include setuptools cmake cffi typing_extensions future six requests dataclasses

# Add these packages if torch.distributed is needed
conda install pkg-config libuv

build and install

uninstall

1
2
3
4
conda uninstall torch
pip uninstall torch

rm -rf build/

then reinstall

1
2
export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}
MACOSX_DEPLOYMENT_TARGET=10.9 CC=clang CXX=clang++ MAX_JOBS=8 BUILD_CAFFE2=0 BUILD_CAFFE2_OPS=0 USE_GLOG=1 USE_DISTRIBUTED=1 USE_MKLDNN=0 USE_CUDA=0 USE_FBGEMM=0 USE_NNPACK=0 USE_QNNPACK=0 USE_XNNPACK=0 python setup.py develop

Quad-Core Intel Core i7 ~ 45min

测试

1
2
3
4
5
6
Python 3.6.12 |Anaconda, Inc.| (default, Sep  8 2020, 17:50:39)
[GCC Clang 10.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> torch.__version__
'1.8.0a0+f1a8a82'

TcpStore

1
2
3
4
5
6
7
python test/distributed/test_c10d.py

Python 3.6.12 |Anaconda, Inc.| (default, Sep 8 2020, 17:50:39)
[GCC Clang 10.0.0 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch.distributed as dist
>>> server_store = dist.TCPStore("127.0.0.1", 18668, 1, True)

or

1
./build/bin/TCPStoreTest

https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-termination

简而言之,pod 被删除时

  1. kubelet 触发 container runtime 对 pod 中的每个 container 的 1 号进程,发送 TERM 信号
  2. 等待 the grace period expires (terminationGracePeriodSeconds 默认为 30s)
  3. 如果 the grace period expires 后 containers 仍未退出,则 kubelet 触发 container runtime,向 pod 中的每个 container 中仍然处于 running 状态的进程发送 KILL 信号

正确处理 TERM 信号,可以让业务优雅退出(or 更快退出);例如

假设 pod command 为

https://kubernetes.io/docs/tasks/inject-data-application/define-command-argument-container/#run-a-command-in-a-shell

1
2
command: ["/bin/bash"]
args: ["-c", "/home/rt/run.sh"]

or

1
2
3
4
command:
- "/bin/bash"
- "-c"
- "/home/rt/run.sh"

/bin/bash /home/rt/run.sh 是 1 号进程

在 /home/rt/run.sh 中可以如此处理,以达到优雅退出的目的

1
2
3
4
5
6
7
8
9
10
11
function prog_exit {
echo "receive SIGTERM signal"
pkill python
}

trap prog_exit SIGTERM

# main function
python /home/rt/train.py &

wait $!

ref: docker stop

https://docs.docker.com/engine/reference/commandline/stop/

The main process inside the container will receive SIGTERM, and after a grace period, SIGKILL.

0%