golang memory model

发表于 2021-12-26

本文字数： 1.8k 阅读时长 ≈ 2 分钟

Note that a read r may observe the value written by a write w that happens concurrently with r. Even if this occurs, it does not imply that reads happening after r will observe writes that happened before w.

var a, b int

func f() {
	a = 1
	b = 2
}

func g() {
	print(b)
	print(a)
}

func main() {
	go f()
	g()
}

it can happen that g prints 2 and then 0.

A send on a channel happens before the corresponding receive from that channel completes.

var c = make(chan int, 10)
var a string

func f() {
	a = "hello, world"
	c <- 0 // send on c
}

func main() {
	go f()
	<-c
	print(a)
}

is guaranteed to print “hello, world”. The write to a happens before the send on c, which happens before the corresponding receive on c completes, which happens before the print.

The closing of a channel happens before a receive that returns a zero value because the channel is closed.

In the previous example, replacing c <- 0 with close(c) yields a program with the same guaranteed behavior.

A receive from an unbuffered channel happens before the send on that channel completes.

var c = make(chan int)
var a string

func f() {
	a = "hello, world"
	<-c
}

func main() {
	go f()
	c <- 0
	print(a)
}

is also guaranteed to print “hello, world”. The write to a happens before the receive on c, which happens before the corresponding send on c completes, which happens before the print.

If the channel were buffered (e.g., c = make(chan int, 1)) then the program would not be guaranteed to print “hello, world”. (It might print the empty string, crash, or do something else.)

The kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.

This program starts a goroutine for every entry in the work list, but the goroutines coordinate using the limit channel to ensure that at most three are running work functions at a time.

var limit = make(chan int, 3)

func main() {
	for _, w := range work {
		go func(w func()) {
			limit <- 1
			w()
			<-limit
		}(w)
	}
	select{}
}

infiniband ethernet

发表于 2021-12-26 更新于 2022-03-06

本文字数： 5.1k 阅读时长 ≈ 5 分钟

Configure RoCE

https://community.mellanox.com/s/article/howto-configure-roce-on-connectx-4

https://community.mellanox.com/s/article/understanding-show-gids-script

Use ibv_query_gid and ibv_find_gid_index functions defined in libibverbs to get the desired GID index.

根据上述材料可知，RoCE 首先需要网卡设备支持比如 mlnx ConnectX-4

以 mlnx 网卡设备为例

找到 mlnx 设备 GID 映射到的网络设备

cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/1

查看 GIDs 1 对应的 RoCE type

cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/1

查看 GIDs 1 地址

cat /sys/class/infiniband/mlx5_0/ports/1/gids/1

Interface	GID Index	RoCE version	GID Address
ens785f0	1	RoCEv2	fe80:0000:0000:0000:e61d:2dff:fef2:a488

确定好需要使用的 GID 后，可使用 ib_send_bw 指定 GID 进行 RoCE 通信

另外注意到

https://community.mellanox.com/s/article/howto-configure-roce-on-connectx-4

在 mlnx 设备映射到的网络设备中增加的 vlan 网卡也支持 RoCE

RoCE in container

NCCL RoCE failed in container

NCCL WARN Call to ibv_modify_qp failed with error No such device

// IB setup
ibv_context* ctx = ncclIbDevs[lComm->dev].context;
uint8_t ib_port = ncclIbDevs[lComm->dev].port;
struct ibv_port_attr portAttr;
NCCLCHECK(wrap_ibv_query_port(ctx, ib_port, &portAttr));
union ibv_gid gid;
NCCLCHECK(wrap_ibv_query_gid(ctx, ib_port, ncclParamIbGidIndex(), &gid));

// QP Creation
NCCLCHECK(ncclIbInitVerbs(ctx, &rComm->verbs));
NCCLCHECK(ncclIbCreateQp(ib_port, &rComm->verbs, IBV_ACCESS_REMOTE_WRITE, &rComm->qp));

// Adjust the MTU
remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);

// Setup QP
struct ibv_qp* qp = rComm->qp;
NCCLCHECK(ncclIbRtrQp(qp, &remQpInfo));
NCCLCHECK(ncclIbRtsQp(qp));

ncclIbRtrQp

ncclResult_t ncclIbRtrQp(ibv_qp* qp, struct ncclIbQpInfo* info) {
  struct ibv_qp_attr qpAttr;
  memset(&qpAttr, 0, sizeof(struct ibv_qp_attr));
  qpAttr.qp_state = IBV_QPS_RTR;
  qpAttr.path_mtu = info->mtu;
  qpAttr.dest_qp_num = info->qpn;
  qpAttr.rq_psn = 0;
  qpAttr.max_dest_rd_atomic = 1;
  qpAttr.min_rnr_timer = 12;
  if (info->lid == 0) {
    qpAttr.ah_attr.is_global = 1;
    qpAttr.ah_attr.grh.dgid.global.subnet_prefix = info->spn;
    qpAttr.ah_attr.grh.dgid.global.interface_id = info->iid;
    qpAttr.ah_attr.grh.flow_label = 0;
    qpAttr.ah_attr.grh.sgid_index = ncclParamIbGidIndex();
    qpAttr.ah_attr.grh.hop_limit = 255;
    qpAttr.ah_attr.grh.traffic_class = ncclParamIbTc();
  } else {
    qpAttr.ah_attr.is_global = 0;
    qpAttr.ah_attr.dlid = info->lid;
  }
  qpAttr.ah_attr.sl = ncclParamIbSl();
  qpAttr.ah_attr.src_path_bits = 0;
  qpAttr.ah_attr.port_num = info->ib_port;
  NCCLCHECK(wrap_ibv_modify_qp(qp, &qpAttr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER));
  return ncclSuccess;
}

推测是在容器中虽然发现了 mlnx 设备，但是并没有发现 mlnx 设备对应的网络设备（例如 demo 中的 ens785f0)，也就无法找到可使用的 GID 进行 RoCE 通信

ib_write_bw failed in container

Failed to modify QP 100 to RTR

使用 ib_write_bw 也会报错，看报错信息，与 NCCL 出错的方法一致 ncclIbRtrQp

multus-cni

https://github.com/k8snetworkplumbingwg/multus-cni

理论上需要使用 multus-cni 以 macvlan 的方式增加 RoCE 网络设备到容器中

https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin/issues/18

instead of calico, you should use macvlan cni where those virtual devices are child of enp175s0. RoCE can make use of those netdevices.

Other users are using multus plugin, which allows you to have multiple netdev interfaces in a Pod. Such as first managed default veth interface via your existing plugin, and second macvlan or sriov interface via 2nd cni.
This way you get both of both world for performance and functionality.

根据 multus-cni quick start 文档，假若 multus 实测可兼容目前 k8s 集群默认的 cni 插件的情况下，需要额外增加 macvlan RoCE 网络设备的 crd 资源配置（假若主机上有多个 RoCE 网络设备，则可分别创建多个 crd 资源配置，每个资源配置对应其中一个 RoCE 网络设备）

cat <<EOF | kubectl create -f -
apiVersion: "k8s.cni.cncf.io/v1"
kind: NetworkAttachmentDefinition
metadata:
  name: macvlan-conf
spec:
  config: '{
      "cniVersion": "0.3.0",
      "type": "macvlan",
      "master": "eth0",
      "mode": "bridge",
      "ipam": {
        "type": "host-local",
        "subnet": "192.168.1.0/24",
        "rangeStart": "192.168.1.200",
        "rangeEnd": "192.168.1.216",
        "routes": [
          { "dst": "0.0.0.0/0" }
        ],
        "gateway": "192.168.1.1"
      }
    }'
EOF

当然前提是 k8s 集群中已安装了 macvlan cni

type: This tells CNI which binary to call on disk. Each CNI plugin is a binary that’s called. Typically, these binaries are stored in /opt/cni/bin on each node, and CNI executes this binary. In this case we’ve specified the loopback binary (which create a loopback-type network interface). If this is your first time installing Multus, you might want to verify that the plugins that are in the “type” field are actually on disk in the /opt/cni/bin directory.

https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/

https://www.cni.dev/plugins/current/main/macvlan/

https://docs.docker.com/network/macvlan/

Some applications, especially legacy applications or applications which monitor network traffic, expect to be directly connected to the physical network. In this type of situation, you can use the macvlan network driver to assign a MAC address to each container’s virtual network interface, making it appear to be a physical network interface directly connected to the physical network.

https://docs.docker.com/network/network-tutorial-macvlan/

k8s-crd-clientsets

发表于 2021-12-26

本文字数： 126 阅读时长 ≈ 1 分钟

https://github.com/kubernetes/code-generator

https://cloud.redhat.com/blog/kubernetes-deep-dive-code-generation-customresources

go db connection pool

发表于 2021-09-25

本文字数： 2.8k 阅读时长 ≈ 3 分钟

https://pkg.go.dev/database/sql/driver

sql/driver 中定义了 db driver 应实现的接口，其中明确了 ErrBadConn 的处理方式

The Connector.Connect and Driver.Open methods should never return ErrBadConn.

ErrBadConn should only be returned from Validator, SessionResetter, or a query method if the connection is already in an invalid (e.g. closed) state.

var ErrBadConn = errors.New("driver: bad connection")

ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.

To prevent duplicate operations, ErrBadConn should NOT be returned if there’s a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn’t return ErrBadConn.

简而言之，当 sql driver 返回 ErrBadConn 错误时，sql package 应使用 new connection 重试

https://pkg.go.dev/database/sql

golang native db connection pool

connection retry 机制结合 golang native sql Query/Exec 实现理解

https://github.com/golang/go/issues/11978

// maxBadConnRetries is the number of maximum retries if the driver returns
// driver.ErrBadConn to signal a broken connection before forcing a new
// connection to be opened.
const maxBadConnRetries = 2

// QueryContext executes a query that returns rows, typically a SELECT.
// The args are for any placeholder parameters in the query.
func (db *DB) QueryContext(ctx context.Context, query string, args ...interface{}) (*Rows, error) {
	var rows *Rows
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		rows, err = db.query(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.query(ctx, query, args, alwaysNewConn)
	}
	return rows, err
}

// Query executes a query that returns rows, typically a SELECT.
// The args are for any placeholder parameters in the query.
func (db *DB) Query(query string, args ...interface{}) (*Rows, error) {
	return db.QueryContext(context.Background(), query, args...)
}

Exec 实现

// ExecContext executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) ExecContext(ctx context.Context, query string, args ...interface{}) (Result, error) {
	var res Result
	var err error
	for i := 0; i < maxBadConnRetries; i++ {
		res, err = db.exec(ctx, query, args, cachedOrNewConn)
		if err != driver.ErrBadConn {
			break
		}
	}
	if err == driver.ErrBadConn {
		return db.exec(ctx, query, args, alwaysNewConn)
	}
	return res, err
}

// Exec executes a query without returning any rows.
// The args are for any placeholder parameters in the query.
func (db *DB) Exec(query string, args ...interface{}) (Result, error) {
	return db.ExecContext(context.Background(), query, args...)
}

综上 ErrBadConn 时，最多重试 2 次，使用 cached conn 或 new conn；超过重试次数，再尝试使用 new conn 1 次

psql BadConn

https://www.postgresql.org/docs/10/app-psql.html#id-1.9.4.18.7

2 if the connection to the server went bad and the session was not interactive

resource limit in k8s

发表于 2021-09-05 更新于 2021-09-13

本文字数： 3.6k 阅读时长 ≈ 3 分钟

Docker Resource Limit

CPU

https://docs.docker.com/config/containers/resource_constraints/#cpu

--cpu-period: CFS 调度算法中的 cpu 时间分片，默认为 100ms
--cpu-quota: CFS quota，在每个 cpu period 分片中，在 cpu 限流前，docker container 能使用的 cpu 时间
--cpuset-cpus: docker container binding to cpu core
--cpu-shares: Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit.

Memory

https://docs.docker.com/config/containers/resource_constraints/#limit-a-containers-access-to-memory

--memory: The maximum amount of memory the container can use (cgroup limit)
--memory-swap:
--oom-kill-disable:

针对 OOM 补充说明如下

容器内进程使用 memory 超过限制，kernel 会触发 oom killer (cgroup)，kill oom_score 高分进程
容器只要 1 pid 进程未退出，则容器不会退出

OOM 始终针对的是进程，而非容器

Docker Container OOMKilled status

容器内的子进程发生了 oom killed，在 docker container 退出时也会被设置 OOMKilled 标志；参考该 issue

https://github.com/moby/moby/issues/15621#issuecomment-181418985

在 docker container 未退出时会设置 container event

https://docs.docker.com/engine/reference/commandline/events/

docker container 设置 OOMKilled 原理，参考该 issue

https://github.com/moby/moby/issues/38352#issuecomment-446329512

在实现上

containerd 监听了一系列事件，假若获取到 cgroup oom event 则记录 OOMKilled = true
containerd 将处理后的事件发送至 dockerd 进一步处理
dockerd 在处理 OOM 事件时，记录 container oom 事件
dockerd 在处理 Exit 事件时，将 OOMKilled = true 写入容器的 status

K8S Resource Limit

https://kubernetes.io/docs/concepts/configuration/manage-resources-containers/#how-pods-with-resource-limits-are-run

CPU (Docker Container config)

CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.

--cpu-shares: max({requests.cpu} * 1024, 2)

例如 requests 为 180，则 --cpu-shares=184320

--cpu-period: 100
--cpu-quota: limits.cpu * 100

https://stackoverflow.com/a/63352630

The resulting value is the total amount of CPU time in microseconds that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.

The default quota period is 100ms. The minimum resolution of CPU quota is 1ms.

cpu 时间分片为 period，quota 为实际每个 period 周期中，可使用的 cpu time；假若受到 qutoa 限制的 cpu 任务，在当前 period 的 quota 仍未完成，则当前任务挂起，等待下个 period 继续执行

multi cpu 机器注意 quota 可以是 period 的倍数，例如限制 container 使用 0.5 cpu，则 --cpu-quota=50，假若主机有 20 cpu，限制 container 使用 10 cpu，则 --cpu-quota=10*100=1000

Memory (Docker Container config)

--memory: int({limits.memory})
--memory-swap: int({limits.memory})

the container does not have access to swap

K8s OOM Watcher

https://github.com/kubernetes/kubernetes/blob/v1.22.1/pkg/kubelet/oom/oom_watcher_linux.go

/dev/kmsg

Start watches for system oom’s and records an event for every system oom encountered.

当 kubelet 观测到节点发生 system oom 时（非 cgroup oom），生成 event；可通过 kubectl 工具查询

1	kubectl get event --field-selector type=Warning,reason=SystemOOM

如下 PR 尝试将 pod 内进程 oom 关联至 pod，未合入

https://github.com/kubernetes/kubernetes/issues/100483

https://github.com/kubernetes/kubernetes/pull/100487

learning lenet5

发表于 2021-08-15 更新于 2021-09-05

本文字数： 389 阅读时长 ≈ 1 分钟

https://www.tensorflow.org/api_docs/python/tf/pad

paddings is an integer tensor with shape [n, 2], where n is the rank of tensor.

each dimension D

paddings[D, 0]: add before tensor
paddings[D, 1]: add after tensor

https://www.tensorflow.org/api_docs/python/tf/expand_dims

pod spec of volcano job and sysctl

发表于 2021-08-14 分类于笔记

本文字数： 1.4k 阅读时长 ≈ 1 分钟

pod spec of volcano job

https://github.com/volcano-sh/volcano/blob/v1.3.0/pkg/controllers/job/job_controller_util.go

import (
    v1 "k8s.io/api/core/v1"
    ...
)

// MakePodName append podname,jobname,taskName and index and returns the string.
func MakePodName(jobName string, taskName string, index int) string {
	return fmt.Sprintf(jobhelpers.PodNameFmt, jobName, taskName, index)
}

func createJobPod(job *batch.Job, template *v1.PodTemplateSpec, ix int) *v1.Pod {
	templateCopy := template.DeepCopy()

	pod := &v1.Pod{
		ObjectMeta: metav1.ObjectMeta{
			Name:      jobhelpers.MakePodName(job.Name, template.Name, ix),
			Namespace: job.Namespace,
			OwnerReferences: []metav1.OwnerReference{
				*metav1.NewControllerRef(job, helpers.JobKind),
			},
			Labels:      templateCopy.Labels,
			Annotations: templateCopy.Annotations,
		},
		Spec: templateCopy.Spec,
	}
    
        ...
}

sysctl of pod spec

https://kubernetes.io/docs/tasks/administer-cluster/sysctl-cluster/

apiVersion: v1
kind: Pod
metadata:
  name: sysctl-example
spec:
  securityContext:
    sysctls:
    - name: net.ipv4.ip_local_port_range
      value: "30000 50000"

find out current ip local port range

1	cat /proc/sys/net/ipv4/ip_local_port_range

https://www.thegeekdiary.com/how-to-reserve-a-port-range-for-a-third-party-application-in-centos-rhel/

Note: ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments.

k8s init and sidecar container

发表于 2021-07-18 分类于笔记

本文字数： 661 阅读时长 ≈ 1 分钟

graph LR
InitContainer --> TrainingContainer
InitContainer --> SidecarContainer

InitContainer and SidecarContainer act like system container and they are transparent to the TrainingContainer

TrainingJob(process) of user is running at TrainingContainer

we can do the init env action at InitContainer, such as download data, and the upload action can be done at SidecarContainer

however, there will be an engineering problem, that is, the file read permission problem. The best way is to make the InitC / SidecarC / TrainingC users (uid) the same

powered by mermaid

https://mermaid-js.github.io/mermaid/#/flowchart

https://theme-next.js.org/docs/tag-plugins/mermaid.html?highlight=mermaid

https://github.com/theme-next/hexo-theme-next/pull/649

golang cli context

发表于 2021-07-12 更新于 2021-07-18 分类于笔记

本文字数： 2.3k 阅读时长 ≈ 2 分钟

https://blog.golang.org/context#:~:text=A%20Context%20is%20safe%20for,to%20signal%20all%20of%20them.

A Context is safe for simultaneous use by multiple goroutines. Code can pass a single Context to any number of goroutines and cancel that Context to signal all of them.

project structure

.
├── cmd
│   └── command.go
├── go.mod
├── go.sum
├── main.go
└── pkg
    └── run
        └── long_run_cli.go

3 directories, 5 files

main.go

package main

import (
	"context"
	"os"
	"os/signal"
	"syscall"

	"zs/toolkit-cli/cmd"
)

func main() {
	c := make(chan os.Signal, 2)
	signal.Notify(c, syscall.SIGINT, syscall.SIGTERM)

	ctx := context.Background()
	ctx, cancel := context.WithCancel(ctx)

	go func() {
		select {
		case <-c:
			cancel()
		}
	}()

	cmd.Execute(ctx)
}

command.go

package cmd

import (
	"context"
	"fmt"
	"os"
	"os/exec"

	"github.com/spf13/cobra"

	"zs/toolkit-cli/pkg/run"
)

var rootCmd = &cobra.Command{
	Use:   "long run cli",
	Run: func(cmd *cobra.Command, args []string) {
		cli := run.New()
		err := cli.LongRun(cmd.Context())

		if err != nil {
			fmt.Printf("cli run err: %v\n", err)
			if exitError, ok := err.(*exec.ExitError); ok {
				fmt.Printf("exit code: %d\n", exitError.ExitCode())
			}
		}
	},
}

func Execute(ctx context.Context) {
	if err := rootCmd.ExecuteContext(ctx); err != nil {
		fmt.Printf("err: %v\n", err)
		os.Exit(1)
	}
}

long_run_cli.go

package run

import (
	"context"
	"os/exec"
)

type CLI struct {

}

func (cli CLI) LongRun(ctx context.Context) error {
	cmd := exec.CommandContext(ctx, "sleep", "30")
	return cmd.Run()
}

func New() *CLI {
	return &CLI{}
}

https://pkg.go.dev/os/exec#CommandContext

The provided context is used to kill the process (by calling os.Process.Kill) if the context becomes done before the command completes on its own.

https://github.com/golang/go/issues/21135

proposal: os/exec: allow user of CommandContext to specify the kill signal when context is done

commandContext will trigger SIGKILL when the ctx is done …

get pid of pipeline command in background

发表于 2021-07-11 分类于笔记

本文字数： 1.7k 阅读时长 ≈ 2 分钟

get exit code of pipeline command in background

https://stackoverflow.com/questions/37257668/get-exit-code-of-a-piped-background-process

https://stackoverflow.com/questions/35842600/tee-resets-exit-status-is-always-0

someCommand="python test.py"

{
  ${someCommand} 2>&1 | tee -a training.log
  exit ${PIPESTATUS[0]}
} &

wait $!
echo $?

回显

综上 wait 即使指定的是 pid，然而内部代码实现依然会 wait pid 对应的 job，这点 wait 的文档里边说的比较隐晦

https://www.gnu.org/software/bash/manual/html_node/Job-Control-Builtins.html

Wait until the child process specified by each process ID pid or job specification jobspec exits and return the exit status of the last command waited for.

注意 return the exit status of the last command waited for

所以上述代码，wait 命令实际上获取到的是 tee 命令的退出码

在 shell 中获取 pipeline command status 的简易方法似乎只能通过 ${PIPESTATUS[0]}

get pid of pipeline command in background

进一步的，我们想获取 someCommand 的 pid，有办法么，尝试做如下改造

someCommand="python test.py"

{
    ${someCommand} 2>&1 &
    pid_someCommand=$!
    wait ${pid_someCommand}
    exit $?
} | tee -a training.log &

wait $!
echo ${PIPESTATUS[0]}

回显

but not work

最后只能使用 ps -ef | grep someCommand 的终极大法，加上通过 subshell pid 作为 parent id 过滤

someCommand="python test.py"

{
  ${someCommand} 2>&1 | tee -a training.log
  exit ${PIPESTATUS[0]}
} &
someCommand_job_pid=$!

someCommand_pid=`ps -efj | awk -v parent_pid=${someCommand_job_pid} '$3==parent_pid { print $0 }' | grep "${someCommand}" | awk '{ print $2 }'`
echo someCommand_pid ${someCommand_pid}

wait ${someCommand_job_pid}
echo $?

回显

1 2	someCommand_pid 55863 127

test.py

import time
import sys

time.sleep(5)
sys.exit(127)