0%

log output format sample

1
2
INFO[2021-07-04 15:26:26]main.go:28 have a nice day                               zs=log
INFO[2021-07-04 15:26:26]main.go:29 zs gogogo zs=log

code sample

show timestamp

the meaning of [0000]

add common prefix

have a little overhead, add filename and line number

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
package main

import (
"path"
"runtime"
"strconv"

"github.com/sirupsen/logrus"
)

func main() {
var log = logrus.New()

formatter := &logrus.TextFormatter{
FullTimestamp: true,
TimestampFormat: "2006-01-02 15:04:05",
CallerPrettyfier: func(f *runtime.Frame) (string, string) {
_, filename := path.Split(f.File)
// do not log func name
return "", filename + ":" + strconv.Itoa(f.Line)
},
}
log.SetFormatter(formatter)
log.SetReportCaller(true)

contextLogger := log.WithField("zs", "log")

contextLogger.Info("have a nice day")
contextLogger.Infof("%s gogogo", "zs")
}

third-party formatter

https://github.com/sirupsen/logrus#formatters

log output format sample

1
2
[2021-07-04 15:50:26]  INFO log: have a nice day
[2021-07-04 15:50:26] INFO log: zs gogogo

code sample

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
package main

import (
"github.com/sirupsen/logrus"
prefixed "github.com/x-cray/logrus-prefixed-formatter"
)

func main() {
var log = logrus.New()

formatter := &prefixed.TextFormatter{
FullTimestamp: true,
TimestampFormat: "2006-01-02 15:04:05",
}
log.Formatter = formatter

contextLogger := log.WithField("prefix", "log")

contextLogger.Info("have a nice day")
contextLogger.Infof("%s gogogo", "zs")
}

as previous code show

1
contextLogger := log.WithField("prefix", "log")

u can prefix a log key and colon before the msg output

Download

https://rsync.samba.org/

最新版本:Rsync version 3.2.3 released

How rsync works

https://rsync.samba.org/how-rsync-works.html

Guide

https://download.samba.org/pub/rsync/rsync.html

  • --recursive: recurse into directories
  • --append: append data onto shorter files
  • --filter
1
/usr/local/Cellar/rsync/3.2.3/bin/rsync --verbose --no-whole-file --recursive --append --include='*.log' --include='*/' --exclude='*' --prune-empty-dirs dir1/ dir2/

注意 rsync 本地目录的特殊之处

https://superuser.com/questions/234273/why-doest-rsync-use-delta-transfer-for-local-files

–whole-file, This is the default when both the source and destination are specified as local paths, but only if no batch-writing option is in effect.

High Availability

https://unix.stackexchange.com/questions/48298/can-rsync-resume-after-being-interrupted

https://community.mellanox.com/s/article/in-between-ethernet-vlans-and-infiniband-pkeys

https://community.mellanox.com/s/article/howto-use-infiniband-pkey-membership-types-in-virtualization-environment--connectx-3--connectx-3-pro-x

https://community.mellanox.com/s/article/howto-configure-ipoib-networks-with-gateway-and-multiple-pkeys

https://community.mellanox.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet

https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin

https://github.com/mellanox/k8s-rdma-shared-dev-plugin

https://docs.openshift.com/container-platform/4.6/networking/hardware_networks/add-pod.html#add-pod

IOV: I/O Virtualization

Single Root I/O Virtualization (SR-IOV) network

https://docs.openshift.com/container-platform/4.6/networking/hardware_networks/about-sriov.html

https://github.com/k8snetworkplumbingwg/sriov-cni

https://docs.mellanox.com/display/MLNXOFEDv461000/Kubernetes%20Using%20SR-IOV

https://community.mellanox.com/s/article/kubernetes-ipoib-sriov-networking-with-connectx4-connectx5

Type size

https://golang.org/ref/spec#Size_and_alignment_guarantees

https://github.com/ardanlabs/gotraining-studyguide/blob/master/go/language/struct.go

1
2
3
4
5
type example struct {
flag bool
counter int16
pi float32
}

字节对齐系数 #pragma pack(n)

  • 成员对齐
  • 结构体对齐

对齐系数规则

  1. For a variable x of any type: unsafe.Alignof(x) is at least 1.
  2. For a variable x of struct type: unsafe.Alignof(x) is the largest of all the values unsafe.Alignof(x.f) for each field f of x, but at least 1.
  3. For a variable x of array type: unsafe.Alignof(x) is the same as the alignment of a variable of the array’s element type.

layout

  • bool(0)
  • int16(2)
  • float32(4)

8 bytes

https://eddycjy.gitbook.io/golang/di-1-ke-za-tan/go-memory-align

//TODO list

deep residual learning framework

docker security

https://docs.docker.com/engine/security/

docker security 总的来说,一个是使用了 kernel namespace 技术,为每个 container 创建了 process, network 等 namepsace,使得多个 container 不会有很大的相互影响

另外一个方面是使用了 control groups 技术,用于限制 container 所使用的各类资源

ensure that each container gets its fair share of memory, CPU, disk I/O

简单理解,比如 cpu 资源,cgroup 用于避免某个 container 不当使用(或者恶意 or 无意代码 bug)cpu,导致其他 container 没法正常使用 cpu 的场景

container root user

https://docs.docker.com/engine/security/userns-remap/

container 中不建议使用 root 用户执行进程,很大部分原因因为容器内的 uid gid 会映射到 host 上,举个例子,一旦容器内的进程逃逸到 host 上,那么它也有 root 用户的权限

虽然说容器内的进程逃逸,是很严重的安全问题,docker 社区会第一时间修复

AWS SageMaker

AWS SageMaker 训练容器镜像设计体验

不是优点的优点: 看起来只支持同构资源,训练资源分配模型为单节点单容器,理解上简单

优势

  • 容器镜像功能层次丰富,每个层次都有文档描述如何实施,Level 0 对容器镜像约束最少,自定义程度最高
    • Level 0: 完全自定义容器镜像,容器镜像指定 Entrypoint,Entrypoint 命令能处理 train 参数即可 link
    • Level 1: 改造已有容器镜像,使得其可利用 SageMaker Toolkits 来进行训练与推理(即把已有镜像改造为 SageMaker 镜像)link
    • Level 2: 使用预置的 SageMaker 容器镜像
    • Level 3: 使用扩展的预置 SageMaker 容器镜像(基于预置的容器镜像扩展功能)
  • 训练启动脚本 (Training Toolkits) 开源,并且可通过 pip install sagemaker-training 直接安装,常用深度学习引擎有独立的 toolkits,均包含在 training toolkits 中 link
  • 可在 Notebook 中直接构建容器镜像
  • 可在 local machine 测试容器镜像基本功能(可能仅限单机训练?)

主打场景

AI 开发者:文档详细且丰富(技术向),容器镜像可玩度高(约束少)

Azure Machine Learning

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-with-custom-image

Azure 与 conda 结合,有个 Environment 的概念,对容器镜像有如下约束

  • Ubuntu 16.04 or greater.
  • Conda 4.5.# or greater.
  • Python 3.5+.

当然如果不使用 Environment,也就无上述约束

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-tensorflow#distributed-training

资源分配模式看起来也是单节点单容器

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/ml-frameworks/tensorflow/distributed-tensorflow-with-horovod

TODO …

https://developer.nvidia.com/gpudirect

环境信息

  • Kernel: 3.10.0-514.44.5.10.h254.x86_64 (uname -r)
  • Nvidia Driver: 440.33.01 (nvidia-smi)
  • MLNX OFED: 4.3-1.0.1.0 (ofed_info)
  • Mellanox/nv_peer_memory: Tag 1.1-0

容器化安装 NVIDIA Driver 看起来会出现 lsmod | grep nvidia 能找到,然而 modinfo nvidia 会提示找不到 Module 的错误

需要修改 nv_peer_memory 代码库的构建脚本,workaround 上述问题

DIY nv_peer_memory 编译

准备空目录

1
2
mkdir -p /root/nv_peer_memory
cd /root/nv_peer_memory

NVIDIA Driver

https://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run

1
2
3
4
5
# 下载 `NVIDIA-Linux-x86_64-440.33.01.run`
curl -o NVIDIA-Linux-x86_64-440.33.01.run 'https://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run'

# 解压至当前目录
./NVIDIA-Linux-x86_64-440.33.01.run -x

nv_peer_memory

https://github.com/Mellanox/nv_peer_memory/tree/1.1-0

https://www.mellanox.com/products/GPUDirect-RDMA

1
2
curl -o nv_peer_memory-1.1-0.tar.gz 'https://github.com/Mellanox/nv_peer_memory/archive/1.1-0.tar.gz'
tar xzf nv_peer_memory-1.1-0.tar.gz

DIY 编译

1
cd nv_peer_memory-1.1-0

修改 Makefile 中的 nv_sources 为 NVIDIA Driver 源码位置

1
nv_sources=/root/nv_peer_memory/NVIDIA-Linux-x86_64-440.33.01/kernel

修改 create_nv.symvers.sh 中的 nvidia_mod 为主机上安装的 NVIDIA Driver .ko 位置,例如

1
nvidia_mod=/var/k8s/nvidia/drivers/nvidia.ko

编译

参考 nv_peer_memory README.md

1
2
3
./build_module.sh

rpmbuild --rebuild /tmp/nvidia_peer_memory-1.1-0.src.rpm

安装 rpm

1
rpm -ivh /root/rpmbuild/RPMS/x86_64/nvidia_peer_memory-1.1-0.x86_64.rpm

测试

1
lsmod | grep nv_peer_mem

NCCL_DEBUG=INFO,例如

NCCL version 2.4.8+cuda10.1

1
NCCL INFO Ring 00 : 3 -> 10 [send] via NET/IB/0/GDRDMA

Trick

  • nvidia_peer_memory 代码中的 create_nv.symvers.sh 可独立执行,由于容器化安装 NVIDIA Driver 场景,modinfo nvidia 会报找不到 mod 的错,可找一台直接在主机侧安装了 NVIDIA driver 的机器,bash -x create_nv.symvers.sh 确认执行过程,以及相关变量取值

  • 如下命令可显示 mod 对应的 ko 文件位置

1
2
$/sbin/modinfo -F filename -k 3.10.0-514.44.5.10.h142.x86_64 nvidia
/lib/modules/3.10.0-514.44.5.10.h142.x86_64/kernel/drivers/video/nvidia.ko

init_process_group

store TCPStore

rank == 0 作为 TCPStore rendezvous handler 的 server

hostname

port

tcp://

rank

world_size

TCPStore

isServer 为 True 时,内部启动 TCPStoreDaemon

waitWorkerReady 为 True 时,10ms 轮询一次是否获取到足够到 workerNumber