--cpu-quota: CFS quota,在每个 cpu period 分片中,在 cpu 限流前,docker container 能使用的 cpu 时间
--cpuset-cpus: docker container binding to cpu core
--cpu-shares: Set this flag to a value greater or less than the default of 1024 to increase or reduce the container’s weight, and give it access to a greater or lesser proportion of the host machine’s CPU cycles. This is only enforced when CPU cycles are constrained. When plenty of CPU cycles are available, all containers use as much CPU as they need. In that way, this is a soft limit.
CPU is always requested as an absolute quantity, never as a relative quantity; 0.1 is the same amount of CPU on a single-core, dual-core, or 48-core machine.
The resulting value is the total amount of CPU time in microseconds that a container can use every 100ms. A container cannot use more than its share of CPU time during this interval.
The default quota period is 100ms. The minimum resolution of CPU quota is 1ms.
cpu 时间分片为 period,quota 为实际每个 period 周期中,可使用的 cpu time;假若受到 qutoa 限制的 cpu 任务,在当前 period 的 quota 仍未完成,则当前任务挂起,等待下个 period 继续执行
multi cpu 机器注意 quota 可以是 period 的倍数,例如限制 container 使用 0.5 cpu,则 --cpu-quota=50,假若主机有 20 cpu,限制 container 使用 10 cpu,则 --cpu-quota=10*100=1000
graph LR
InitContainer --> TrainingContainer
InitContainer --> SidecarContainer
InitContainer and SidecarContainer act like system container and they are transparent to the TrainingContainer
TrainingJob(process) of user is running at TrainingContainer
we can do the init env action at InitContainer, such as download data, and the upload action can be done at SidecarContainer
however, there will be an engineering problem, that is, the file read permission problem. The best way is to make the InitC / SidecarC / TrainingC users (uid) the same
// MakePodName append podname,jobname,taskName and index and returns the string. funcMakePodName(jobName string, taskName string, index int)string { return fmt.Sprintf(jobhelpers.PodNameFmt, jobName, taskName, index) }
Note: ip_local_port_range and ip_local_reserved_ports settings are independent and both are considered by the kernel when determining which ports are available for automatic port assignments.
A Context is safe for simultaneous use by multiple goroutines. Code can pass a single Context to any number of goroutines and cancel that Context to signal all of them.
var rootCmd = &cobra.Command{ Use: "long run cli", Run: func(cmd *cobra.Command, args []string) { cli := run.New() err := cli.LongRun(cmd.Context())
if err != nil { fmt.Printf("cli run err: %v\n", err) if exitError, ok := err.(*exec.ExitError); ok { fmt.Printf("exit code: %d\n", exitError.ExitCode()) } } }, }
Wait until the child process specified by each process ID pid or job specification jobspec exits and return the exit status of the last command waited for.
注意 return the exit status of the last command waited for
所以上述代码,wait 命令实际上获取到的是 tee 命令的退出码
在 shell 中获取 pipeline command status 的简易方法似乎只能通过 ${PIPESTATUS[0]}
get pid of pipeline command in background
进一步的,我们想获取 someCommand 的 pid,有办法么,尝试做如下改造
1 2 3 4 5 6 7 8 9 10 11
someCommand="python test.py"
{ ${someCommand} 2>&1 & pid_someCommand=$! wait ${pid_someCommand} exit $? } | tee -a training.log &