infiniband ethernet
Configure RoCE
https://community.mellanox.com/s/article/howto-configure-roce-on-connectx-4
https://community.mellanox.com/s/article/understanding-show-gids-script
Use
ibv_query_gid
andibv_find_gid_index
functions defined in libibverbs to get the desired GID index.
根据上述材料可知,RoCE 首先需要网卡设备支持比如 mlnx ConnectX-4
以 mlnx 网卡设备为例
- 找到 mlnx 设备 GID 映射到的网络设备
cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/ndevs/1
- 查看 GIDs 1 对应的 RoCE type
cat /sys/class/infiniband/mlx5_0/ports/1/gid_attrs/types/1
- 查看 GIDs 1 地址
cat /sys/class/infiniband/mlx5_0/ports/1/gids/1
Interface | GID Index | RoCE version | GID Address |
---|---|---|---|
ens785f0 | 1 | RoCEv2 | fe80:0000:0000:0000:e61d:2dff:fef2:a488 |
确定好需要使用的 GID 后,可使用 ib_send_bw
指定 GID 进行 RoCE 通信
另外注意到
https://community.mellanox.com/s/article/howto-configure-roce-on-connectx-4
在 mlnx 设备映射到的网络设备中增加的 vlan 网卡也支持 RoCE
RoCE in container
NCCL RoCE failed in container
NCCL WARN Call to ibv_modify_qp failed with error No such device
1 | // IB setup |
ncclIbRtrQp
1 | ncclResult_t ncclIbRtrQp(ibv_qp* qp, struct ncclIbQpInfo* info) { |
推测是在容器中虽然发现了 mlnx 设备,但是并没有发现 mlnx 设备对应的网络设备(例如 demo 中的 ens785f0),也就无法找到可使用的 GID 进行 RoCE 通信
ib_write_bw failed in container
Failed to modify QP 100 to RTR
使用 ib_write_bw
也会报错,看报错信息,与 NCCL 出错的方法一致 ncclIbRtrQp
multus-cni
https://github.com/k8snetworkplumbingwg/multus-cni
理论上需要使用 multus-cni 以 macvlan 的方式增加 RoCE 网络设备到容器中
https://github.com/Mellanox/k8s-rdma-sriov-dev-plugin/issues/18
instead of calico, you should use macvlan cni where those virtual devices are child of enp175s0. RoCE can make use of those netdevices.
Other users are using multus plugin, which allows you to have multiple netdev interfaces in a Pod. Such as first managed default veth interface via your existing plugin, and second macvlan or sriov interface via 2nd cni.
This way you get both of both world for performance and functionality.
根据 multus-cni quick start 文档,假若 multus 实测可兼容目前 k8s 集群默认的 cni 插件的情况下,需要额外增加 macvlan RoCE 网络设备的 crd 资源配置(假若主机上有多个 RoCE 网络设备,则可分别创建多个 crd 资源配置,每个资源配置对应其中一个 RoCE 网络设备)
1 | cat <<EOF | kubectl create -f - |
当然前提是 k8s 集群中已安装了 macvlan cni
type: This tells CNI which binary to call on disk. Each CNI plugin is a binary that’s called. Typically, these binaries are stored in /opt/cni/bin on each node, and CNI executes this binary. In this case we’ve specified the loopback binary (which create a loopback-type network interface). If this is your first time installing Multus, you might want to verify that the plugins that are in the “type” field are actually on disk in the /opt/cni/bin directory.
https://kubernetes.io/docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/
https://www.cni.dev/plugins/current/main/macvlan/
https://docs.docker.com/network/macvlan/
Some applications, especially legacy applications or applications which monitor network traffic, expect to be directly connected to the physical network. In this type of situation, you can use the macvlan network driver to assign a MAC address to each container’s virtual network interface, making it appear to be a physical network interface directly connected to the physical network.