golang logrus

发表于 2021-07-04 分类于笔记

本文字数： 2.2k 阅读时长 ≈ 2 分钟

log output format sample

1 2	INFO[2021-07-04 15:26:26]main.go:28 have a nice day zs=log INFO[2021-07-04 15:26:26]main.go:29 zs gogogo zs=log

code sample

show timestamp

https://github.com/Sirupsen/logrus/issues/415

the meaning of [0000]

https://github.com/Sirupsen/logrus/issues/163

add common prefix

have a little overhead, add filename and line number

package main

import (
	"path"
	"runtime"
	"strconv"

	"github.com/sirupsen/logrus"
)

func main() {
	var log = logrus.New()

	formatter := &logrus.TextFormatter{
		FullTimestamp:   true,
		TimestampFormat: "2006-01-02 15:04:05",
		CallerPrettyfier: func(f *runtime.Frame) (string, string) {
			_, filename := path.Split(f.File)
			// do not log func name
			return "", filename + ":" + strconv.Itoa(f.Line)
		},
	}
	log.SetFormatter(formatter)
	log.SetReportCaller(true)

	contextLogger := log.WithField("zs", "log")

	contextLogger.Info("have a nice day")
	contextLogger.Infof("%s gogogo", "zs")
}

third-party formatter

https://github.com/sirupsen/logrus#formatters

https://github.com/x-cray/logrus-prefixed-formatter

log output format sample

1 2	[2021-07-04 15:50:26] INFO log: have a nice day [2021-07-04 15:50:26] INFO log: zs gogogo

code sample

package main

import (
	"github.com/sirupsen/logrus"
	prefixed "github.com/x-cray/logrus-prefixed-formatter"
)

func main() {
	var log = logrus.New()

	formatter := &prefixed.TextFormatter{
		FullTimestamp:   true,
		TimestampFormat: "2006-01-02 15:04:05",
	}
	log.Formatter = formatter

	contextLogger := log.WithField("prefix", "log")

	contextLogger.Info("have a nice day")
	contextLogger.Infof("%s gogogo", "zs")
}

as previous code show

1	contextLogger := log.WithField("prefix", "log")

u can prefix a log key and colon before the msg output

rsync

发表于 2021-05-30

本文字数： 766 阅读时长 ≈ 1 分钟

Download

https://rsync.samba.org/

最新版本：Rsync version 3.2.3 released

How rsync works

https://rsync.samba.org/how-rsync-works.html

Guide

https://download.samba.org/pub/rsync/rsync.html

--recursive: recurse into directories
--append: append data onto shorter files
--filter

1	/usr/local/Cellar/rsync/3.2.3/bin/rsync --verbose --no-whole-file --recursive --append --include='.log' --include='/' --exclude='*' --prune-empty-dirs dir1/ dir2/

注意 rsync 本地目录的特殊之处

https://superuser.com/questions/234273/why-doest-rsync-use-delta-transfer-for-local-files

–whole-file, This is the default when both the source and destination are specified as local paths, but only if no batch-writing option is in effect.

High Availability

https://unix.stackexchange.com/questions/48298/can-rsync-resume-after-being-interrupted

infiniband sr-iov

发表于 2021-05-23 更新于 2021-05-30 分类于笔记

本文字数： 1k 阅读时长 ≈ 1 分钟

https://community.mellanox.com/s/article/in-between-ethernet-vlans-and-infiniband-pkeys

https://community.mellanox.com/s/article/howto-use-infiniband-pkey-membership-types-in-virtualization-environment--connectx-3--connectx-3-pro-x

https://community.mellanox.com/s/article/howto-configure-ipoib-networks-with-gateway-and-multiple-pkeys

https://community.mellanox.com/s/article/HowTo-Configure-SR-IOV-for-ConnectX-4-ConnectX-5-ConnectX-6-with-KVM-Ethernet

https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin

https://github.com/mellanox/k8s-rdma-shared-dev-plugin

https://docs.openshift.com/container-platform/4.6/networking/hardware_networks/add-pod.html#add-pod

IOV: I/O Virtualization

Single Root I/O Virtualization (SR-IOV) network

https://docs.openshift.com/container-platform/4.6/networking/hardware_networks/about-sriov.html

https://github.com/k8snetworkplumbingwg/sriov-cni

https://docs.mellanox.com/display/MLNXOFEDv461000/Kubernetes%20Using%20SR-IOV

https://community.mellanox.com/s/article/kubernetes-ipoib-sriov-networking-with-connectx4-connectx5

the pod name of volcano job

发表于 2021-04-05 分类于笔记

本文字数： 167 阅读时长 ≈ 1 分钟

https://github.com/volcano-sh/volcano/blob/master/pkg/controllers/job/job_controller_actions.go#L278

1	${vj-job-name}-${task-name}-${index}

learning golang again

发表于 2021-03-28 更新于 2021-04-05 分类于笔记

本文字数： 691 阅读时长 ≈ 1 分钟

Type size

https://golang.org/ref/spec#Size_and_alignment_guarantees

https://github.com/ardanlabs/gotraining-studyguide/blob/master/go/language/struct.go

type example struct {
	flag    bool
	counter int16
	pi      float32
}

字节对齐系数 #pragma pack(n)

成员对齐
结构体对齐

对齐系数规则

For a variable x of any type: unsafe.Alignof(x) is at least 1.

For a variable x of struct type: unsafe.Alignof(x) is the largest of all the values unsafe.Alignof(x.f) for each field f of x, but at least 1.

For a variable x of array type: unsafe.Alignof(x) is the same as the alignment of a variable of the array’s element type.

layout

bool(0)
int16(2)
float32(4)

8 bytes

https://eddycjy.gitbook.io/golang/di-1-ke-za-tan/go-memory-align

//TODO list

resnet50

发表于 2021-03-14 更新于 2021-05-23 分类于笔记

本文字数： 29 阅读时长 ≈ 1 分钟

deep residual learning framework

Docker container secure

发表于 2021-03-13 更新于 2021-03-21 分类于笔记

本文字数： 547 阅读时长 ≈ 1 分钟

docker security

https://docs.docker.com/engine/security/

docker security 总的来说，一个是使用了 kernel namespace 技术，为每个 container 创建了 process, network 等 namepsace，使得多个 container 不会有很大的相互影响

另外一个方面是使用了 control groups 技术，用于限制 container 所使用的各类资源

ensure that each container gets its fair share of memory, CPU, disk I/O

简单理解，比如 cpu 资源，cgroup 用于避免某个 container 不当使用（或者恶意 or 无意代码 bug）cpu，导致其他 container 没法正常使用 cpu 的场景

container root user

https://docs.docker.com/engine/security/userns-remap/

container 中不建议使用 root 用户执行进程，很大部分原因因为容器内的 uid gid 会映射到 host 上，举个例子，一旦容器内的进程逃逸到 host 上，那么它也有 root 用户的权限

虽然说容器内的进程逃逸，是很严重的安全问题，docker 社区会第一时间修复

Deep Learning Training Container Design

发表于 2021-03-03

本文字数： 1k 阅读时长 ≈ 1 分钟

AWS SageMaker

AWS SageMaker 训练容器镜像设计体验

不是优点的优点: 看起来只支持同构资源，训练资源分配模型为单节点单容器，理解上简单

优势

容器镜像功能层次丰富，每个层次都有文档描述如何实施，Level 0 对容器镜像约束最少，自定义程度最高
- Level 0: 完全自定义容器镜像，容器镜像指定 Entrypoint，Entrypoint 命令能处理 train 参数即可 link
- Level 1: 改造已有容器镜像，使得其可利用 SageMaker Toolkits 来进行训练与推理（即把已有镜像改造为 SageMaker 镜像）link
- Level 2: 使用预置的 SageMaker 容器镜像
- Level 3: 使用扩展的预置 SageMaker 容器镜像（基于预置的容器镜像扩展功能）
训练启动脚本 (Training Toolkits) 开源，并且可通过 pip install sagemaker-training 直接安装，常用深度学习引擎有独立的 toolkits，均包含在 training toolkits 中 link
可在 Notebook 中直接构建容器镜像
可在 local machine 测试容器镜像基本功能（可能仅限单机训练？)

主打场景

AI 开发者：文档详细且丰富（技术向），容器镜像可玩度高（约束少）

Azure Machine Learning

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-with-custom-image

Azure 与 conda 结合，有个 Environment 的概念，对容器镜像有如下约束

Ubuntu 16.04 or greater.
Conda 4.5.# or greater.
Python 3.5+.

当然如果不使用 Environment，也就无上述约束

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-train-tensorflow#distributed-training

资源分配模式看起来也是单节点单容器

https://github.com/Azure/MachineLearningNotebooks/tree/master/how-to-use-azureml/ml-frameworks/tensorflow/distributed-tensorflow-with-horovod

TODO …

GPUDirect RDMA

发表于 2021-02-28 更新于 2021-03-02 分类于笔记

本文字数： 1.8k 阅读时长 ≈ 2 分钟

https://developer.nvidia.com/gpudirect

环境信息

Kernel: 3.10.0-514.44.5.10.h254.x86_64 (uname -r)
Nvidia Driver: 440.33.01 (nvidia-smi)
MLNX OFED: 4.3-1.0.1.0 (ofed_info)
Mellanox/nv_peer_memory: Tag 1.1-0

坑

容器化安装 NVIDIA Driver 看起来会出现 lsmod | grep nvidia 能找到，然而 modinfo nvidia 会提示找不到 Module 的错误

需要修改 nv_peer_memory 代码库的构建脚本，workaround 上述问题

DIY nv_peer_memory 编译

准备空目录

1 2	mkdir -p /root/nv_peer_memory cd /root/nv_peer_memory

NVIDIA Driver

https://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run

# 下载 `NVIDIA-Linux-x86_64-440.33.01.run`
curl -o NVIDIA-Linux-x86_64-440.33.01.run 'https://us.download.nvidia.com/tesla/440.33.01/NVIDIA-Linux-x86_64-440.33.01.run'

# 解压至当前目录
./NVIDIA-Linux-x86_64-440.33.01.run -x

nv_peer_memory

https://github.com/Mellanox/nv_peer_memory/tree/1.1-0

https://www.mellanox.com/products/GPUDirect-RDMA

1 2	curl -o nv_peer_memory-1.1-0.tar.gz 'https://github.com/Mellanox/nv_peer_memory/archive/1.1-0.tar.gz' tar xzf nv_peer_memory-1.1-0.tar.gz

DIY 编译

1	cd nv_peer_memory-1.1-0

修改 Makefile 中的 nv_sources 为 NVIDIA Driver 源码位置

1	nv_sources=/root/nv_peer_memory/NVIDIA-Linux-x86_64-440.33.01/kernel

修改 create_nv.symvers.sh 中的 nvidia_mod 为主机上安装的 NVIDIA Driver .ko 位置，例如

1	nvidia_mod=/var/k8s/nvidia/drivers/nvidia.ko

编译

参考 nv_peer_memory README.md

1
2
3

./build_module.sh

rpmbuild --rebuild /tmp/nvidia_peer_memory-1.1-0.src.rpm

安装 rpm

1	rpm -ivh /root/rpmbuild/RPMS/x86_64/nvidia_peer_memory-1.1-0.x86_64.rpm

测试

1	lsmod \| grep nv_peer_mem

NCCL_DEBUG=INFO，例如

NCCL version 2.4.8+cuda10.1

1	NCCL INFO Ring 00 : 3 -> 10 [send] via NET/IB/0/GDRDMA

Trick

nvidia_peer_memory 代码中的 create_nv.symvers.sh 可独立执行，由于容器化安装 NVIDIA Driver 场景，modinfo nvidia 会报找不到 mod 的错，可找一台直接在主机侧安装了 NVIDIA driver 的机器，bash -x create_nv.symvers.sh 确认执行过程，以及相关变量取值
如下命令可显示 mod 对应的 ko 文件位置

1 2	$/sbin/modinfo -F filename -k 3.10.0-514.44.5.10.h142.x86_64 nvidia /lib/modules/3.10.0-514.44.5.10.h142.x86_64/kernel/drivers/video/nvidia.ko

PyTorch Distributed Bootstrap

发表于 2021-02-13 更新于 2021-02-14

本文字数： 195 阅读时长 ≈ 1 分钟

init_process_group

store TCPStore

rank == 0 作为 TCPStore rendezvous handler 的 server

hostname

port

tcp://

rank

world_size

TCPStore

isServer 为 True 时，内部启动 TCPStoreDaemon

waitWorkerReady 为 True 时，10ms 轮询一次是否获取到足够到 workerNumber