gpu device plugin
device plugin init and list-watch
init
device plugin 启动时
1 | func (m *NvidiaDevicePlugin) initialize() { |
调用 m.Devices()
获取当前节点上的 gpu 设备列表信息
list-watch
返回 gpu 设备详情,注意到不健康的设备 health
字段会被设置 Unhealthy
值
1 | for { |
device plugin health check
health 检测的实现也比较直接
1 | go m.CheckHealth(m.stop, m.cachedDevices, m.health) |
使用 nvml
go lib API 将已发现的每个设备注册到 eventSet
,假若不支持该 API 的设备,则直接标记为 Unhealthy
注册 ok 后,开启 for loop 等待 event
1 | // http://docs.nvidia.com/deploy/xid-errors/index.html#topic_4 |
注意到 gpu device plugin 会忽略特定 Xid
,因为这些 Xid
明确不是硬件故障
NVIDIA Health & Diagnostic
https://docs.nvidia.com/deploy/index.html
xid
https://docs.nvidia.com/deploy/xid-errors/index.html#topic_4
The Xid message is an error report from the NVIDIA driver that is printed to the operating system’s kernel log or event log. Xid messages indicate that a general GPU error occurred, most often due to the driver programming the GPU incorrectly or to corruption of the commands sent to the GPU. The messages can be indicative of a hardware problem, an NVIDIA software problem, or a user application problem.
Under Linux, the Xid error messages are placed in the location /var/log/messages. Grep for “NVRM: Xid” to find all the Xid messages.
NVVS (NVIDIA Validation Suite)
https://docs.nvidia.com/deploy/nvvs-user-guide/index.html
Easily integrate into Cluster Scheduler and Cluster Management applications
k8s device
1 | type ListAndWatchResponse struct { |
结合 Health 信息,k8s 调度器就可以忽略 UnHealthy
的 GPU 设备了