Note that a read r may observe the value written by a write w that happens concurrently with r. Even if this occurs, it does not imply that reads happening after r will observe writes that happened before w.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
var a, b int
funcf() { a = 1 b = 2 }
funcg() { print(b) print(a) }
funcmain() { go f() g() }
it can happen that g prints 2 and then 0.
A send on a channel happens before the corresponding receive from that channel completes.
1 2 3 4 5 6 7 8 9 10 11 12 13
var c = make(chanint, 10) var a string
funcf() { a = "hello, world" c <- 0// send on c }
funcmain() { go f() <-c print(a) }
is guaranteed to print “hello, world”. The write to a happens before the send on c, which happens before the corresponding receive on c completes, which happens before the print.
The closing of a channel happens before a receive that returns a zero value because the channel is closed.
In the previous example, replacing c <- 0 with close(c) yields a program with the same guaranteed behavior.
A receive from an unbuffered channel happens before the send on that channel completes.
1 2 3 4 5 6 7 8 9 10 11 12 13
var c = make(chanint) var a string
funcf() { a = "hello, world" <-c }
funcmain() { go f() c <- 0 print(a) }
is also guaranteed to print “hello, world”. The write to a happens before the receive on c, which happens before the corresponding send on c completes, which happens before the print.
If the channel were buffered (e.g., c = make(chan int, 1)) then the program would not be guaranteed to print “hello, world”. (It might print the empty string, crash, or do something else.)
The kth receive on a channel with capacity C happens before the k+Cth send from that channel completes.
This program starts a goroutine for every entry in the work list, but the goroutines coordinate using the limit channel to ensure that at most three are running work functions at a time.
1 2 3 4 5 6 7 8 9 10 11 12
var limit = make(chanint, 3)
funcmain() { for _, w := range work { gofunc(w func()) { limit <- 1 w() <-limit }(w) } select{} }
instead of calico, you should use macvlan cni where those virtual devices are child of enp175s0. RoCE can make use of those netdevices.
Other users are using multus plugin, which allows you to have multiple netdev interfaces in a Pod. Such as first managed default veth interface via your existing plugin, and second macvlan or sriov interface via 2nd cni. This way you get both of both world for performance and functionality.
根据 multus-cni quick start 文档,假若 multus 实测可兼容目前 k8s 集群默认的 cni 插件的情况下,需要额外增加 macvlan RoCE 网络设备的 crd 资源配置(假若主机上有多个 RoCE 网络设备,则可分别创建多个 crd 资源配置,每个资源配置对应其中一个 RoCE 网络设备)
type: This tells CNI which binary to call on disk. Each CNI plugin is a binary that’s called. Typically, these binaries are stored in /opt/cni/bin on each node, and CNI executes this binary. In this case we’ve specified the loopback binary (which create a loopback-type network interface). If this is your first time installing Multus, you might want to verify that the plugins that are in the “type” field are actually on disk in the /opt/cni/bin directory.
Some applications, especially legacy applications or applications which monitor network traffic, expect to be directly connected to the physical network. In this type of situation, you can use the macvlan network driver to assign a MAC address to each container’s virtual network interface, making it appear to be a physical network interface directly connected to the physical network.
sql/driver 中定义了 db driver 应实现的接口,其中明确了 ErrBadConn 的处理方式
The Connector.Connect and Driver.Open methods should never return ErrBadConn.
ErrBadConn should only be returned from Validator, SessionResetter, or a query method if the connection is already in an invalid (e.g. closed) state.
var ErrBadConn = errors.New("driver: bad connection")
ErrBadConn should be returned by a driver to signal to the sql package that a driver.Conn is in a bad state (such as the server having earlier closed the connection) and the sql package should retry on a new connection.
To prevent duplicate operations, ErrBadConn should NOT be returned if there’s a possibility that the database server might have performed the operation. Even if the server sends back an error, you shouldn’t return ErrBadConn.
// maxBadConnRetries is the number of maximum retries if the driver returns // driver.ErrBadConn to signal a broken connection before forcing a new // connection to be opened. const maxBadConnRetries = 2
// QueryContext executes a query that returns rows, typically a SELECT. // The args are for any placeholder parameters in the query. func(db *DB)QueryContext(ctx context.Context, query string, args ...interface{})(*Rows, error) { var rows *Rows var err error for i := 0; i < maxBadConnRetries; i++ { rows, err = db.query(ctx, query, args, cachedOrNewConn) if err != driver.ErrBadConn { break } } if err == driver.ErrBadConn { return db.query(ctx, query, args, alwaysNewConn) } return rows, err }
// Query executes a query that returns rows, typically a SELECT. // The args are for any placeholder parameters in the query. func(db *DB)Query(query string, args ...interface{})(*Rows, error) { return db.QueryContext(context.Background(), query, args...) }
// ExecContext executes a query without returning any rows. // The args are for any placeholder parameters in the query. func(db *DB)ExecContext(ctx context.Context, query string, args ...interface{})(Result, error) { var res Result var err error for i := 0; i < maxBadConnRetries; i++ { res, err = db.exec(ctx, query, args, cachedOrNewConn) if err != driver.ErrBadConn { break } } if err == driver.ErrBadConn { return db.exec(ctx, query, args, alwaysNewConn) } return res, err }
// Exec executes a query without returning any rows. // The args are for any placeholder parameters in the query. func(db *DB)Exec(query string, args ...interface{})(Result, error) { return db.ExecContext(context.Background(), query, args...) }
综上 ErrBadConn 时,最多重试 2 次,使用 cached conn 或 new conn;超过重试次数,再尝试使用 new conn 1 次
--cpu-quota: CFS quota,在每个 cpu period 分片中,在 cpu 限流前,docker container 能使用的 cpu 时间
--cpuset-cpus: docker container binding to cpu core
--cpu-shares: Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
The resulting value is the total amount of CPU time in microseconds that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.
The default quota period is 100ms. The minimum resolution of CPU quota is 1ms.
cpu 时间分片为 period,quota 为实际每个 period 周期中,可使用的 cpu time;假若受到 qutoa 限制的 cpu 任务,在当前 period 的 quota 仍未完成,则当前任务挂起,等待下个 period 继续执行
multi cpu 机器注意 quota 可以是 period 的倍数,例如限制 container 使用 0.5 cpu,则 --cpu-quota=50,假若主机有 20 cpu,限制 container 使用 10 cpu,则 --cpu-quota=10*100=1000
// MakePodName append podname,jobname,taskName and index and returns the string. funcMakePodName(jobName string, taskName string, index int)string { return fmt.Sprintf(jobhelpers.PodNameFmt, jobName, taskName, index) }
Note: ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments.
graph LR
InitContainer --> TrainingContainer
InitContainer --> SidecarContainer
InitContainer and SidecarContainer act like system container and they are transparent to the TrainingContainer
TrainingJob(process) of user is running at TrainingContainer
we can do the init env action at InitContainer, such as download data, and the upload action can be done at SidecarContainer
however, there will be an engineering problem, that is, the file read permission problem. The best way is to make the InitC / SidecarC / TrainingC users (uid) the same
A Context is safe for simultaneous use by multiple goroutines. Code can pass a single Context to any number of goroutines and cancel that Context to signal all of them.
var rootCmd = &cobra.Command{ Use: "long run cli", Run: func(cmd *cobra.Command, args []string) { cli := run.New() err := cli.LongRun(cmd.Context())
if err != nil { fmt.Printf("cli run err: %v\n", err) if exitError, ok := err.(*exec.ExitError); ok { fmt.Printf("exit code: %d\n", exitError.ExitCode()) } } }, }
Wait until the child process specified by each process ID pid or job specification jobspec exits and return the exit status of the last command waited for.
注意 return the exit status of the last command waited for
所以上述代码,wait 命令实际上获取到的是 tee 命令的退出码
在 shell 中获取 pipeline command status 的简易方法似乎只能通过 ${PIPESTATUS[0]}
get pid of pipeline command in background
进一步的,我们想获取 someCommand 的 pid,有办法么,尝试做如下改造
1 2 3 4 5 6 7 8 9 10 11
someCommand="python test.py"
{ ${someCommand} 2>&1 & pid_someCommand=$! wait ${pid_someCommand} exit $? } | tee -a training.log &