随风飘散的记忆

clusterd

发表于 2025-06-01 更新于 2025-06-02

本文字数： 907 阅读时长 ≈ 1 分钟

https://www.hiascend.com/document/detail/zh/mindcluster/70rc1/clustersched/dlug/mxdlug_007.html

有如下几类 configmap

cmDevice: ns, kube-system; cmName, mindx-dl-deviceinfo-{NodeName}; which is reported by device-plugin
cmNode: ns, mindx-dl; cmName, mindx-dl-nodeinfo-{NodeName}; which is reported by nodeD
cmPingMesh: ns, cluster-system; cmName, pingmesh-config;
cmSuperPodDevice: ns, cluster-system; cmName, super-pod-{SuperPodId}; clusterD 维护
- 特别的 {RAS_NET_ROOT_PATH}/cluster/super-pod-{SuperPodId}/super-pod-{SuperPodId}.json; clusterD 维护
cmPubicFault: mc-consumer-publicfault=true label;

其中 cmDevice configmap mindx-dl-deviceinfo-{NodeName}, 由 device-plugin 上报, 包括如下信息

DeviceInfoCfg
SwitchInfoCfg

cmPubicFault configmap, 包括如下信息

PublicFault

pingmesh-config 的格式为 global pingmesh 任务的配置或者是指定 superpodid 的任务配置

{
    "activate": "on",
    "task_interval": 5
}

node annotation 中包括如下信息

product-serial-number
superPodID
baseDeviceInfos
serverType
serverIndex

model

发表于 2024-02-24 更新于 2024-03-31

本文字数： 1.4k 阅读时长 ≈ 1 分钟

https://wangcong.net/article/FPandBP.html

pathways

https://blog.research.google/2022/04/pathways-language-model-palm-scaling-to.html

a single model that could generalize across domains and tasks while being highly efficient. An important milestone toward realizing this vision was to develop the new Pathways system to orchestrate distributed computation for accelerators.

few-shot

TPU v4 Pods

Pipelining is typically used with DCN

word to vector, This vector represents the word’s meaning and context within the given language

embedding layer, lookup table

Positional encoding

https://medium.com/@tech-gumptions/transformer-architecture-simplified-3fb501d461c8

This means that the output of a layer is added to the initial input, allowing the model to learn to only make small changes to the input

The decoder’s job is to produce the English sentence based on both the original French sentence and the bits of the English sentence it has generated so far.

Input Embedding: Just as with the Encoder, the input to the Decoder (which is the target sequence during training) is first embedded into continuous vectors.

It’s important to note that this masking is only applied during training. During inference, the decoder can attend to all words in the target sequence, including future words.

To summarize, the Decoder in the Transformer architecture processes its input through self-attention, cross-attention with the Encoder’s output, and position-wise Feed-Forward networks, repeatedly for each stacked block, culminating in a final output sequence after the softmax operation.

nccl tests

发表于 2024-02-01 更新于 2024-01-31

本文字数： 0 阅读时长 ≈ 1 分钟

modelarts python sdk job demo

发表于 2023-05-09 更新于 2023-05-08

本文字数： 1.6k 阅读时长 ≈ 1 分钟

from modelarts.session import Session
from modelarts.estimatorV2 import Estimator
from modelarts.train_params import OutputData
from modelarts.train_params import InputData

session = Session(access_key='XXX',secret_key='YYY', project_id='ZZZ', region_name='cn-north-4')

# list job
# job_list = Estimator.get_job_list(session=session, offset=0, limit=10, sort_by="create_time", order="desc")
# print(job_list)

# create a basic training job
estimator = Estimator(session=session,
                      job_description='This is a basic training job',
                      user_image_url="deep-learning-demo/mpi:3.0.0-cuda10.2", # main container 的容器镜像地址
                      user_command="echo hello-world",  # main container 的启动命令
                      outputs=[OutputData(obs_path="obs://zs-modelarts/pytorch/model/", name="model", local_path="/model", access_method="env")],
                      log_url="obs://zs-modelarts/pytorch/log/", # 训练作业日志转存 obs 路径
                      train_instance_type="modelarts.p3.large.public.free", # 公共资源池
                      train_instance_count=1 # 训练作业节点个数
                      )

job_instance = estimator.fit(job_name="job-0")

# get job id in job_instance
print(job_instance.job_id)

# view the training job log
# estimator = Estimator(session=session, job_id="2bfc13b6-782e-45ad-ae90-476dfa97591a")
# info = estimator.get_job_log()
# print(info)

# view the training job metrics
# estimator = Estimator(session=session, job_id="2bfc13b6-782e-45ad-ae90-476dfa97591a")
# info = estimator.get_job_metrics()
# print(info)

# delete the training job metrics
# Estimator.delete_job_by_id(session=session, job_id="2bfc13b6-782e-45ad-ae90-476dfa97591a")

roce flow control

发表于 2023-05-07

本文字数： 3k 阅读时长 ≈ 3 分钟

个人理解记录

所谓无损, 也就是不丢包; 通过 global pause, pfc, dcqcn 等不断演进的流控/拥塞控制协议, 来保障在丢包之前控制源头降速, 避免丢包

ifconfig vs ethtool

https://enterprise-support.nvidia.com/s/article/ibdev2netdev

ibdev2netdev

执行上述命令可查询得到 it maps the adapter port to the net device

对于 infiniband 类型的 link layer, 一般来说上述命令得到的是 ib0 设备, 即 IPoIB 虚拟网卡; 对于 ethernet 类型的 link layer，一般来说上述命令得到的是 ens[xxx] 网卡设备

另外注意到 ifconfig ens[xxx] 中显示的 Tx 与 Rx, 实际上与 ethtool -S ens[xxx] 中的如下值一致

https://enterprise-support.nvidia.com/s/article/understanding-mlx5-ethtool-counters

rx_bytes: Representor only: bytes received, that were handled by the hypervisor. supported from kernel 4.18
tx_bytes: Representor only: bytes transmitted, that were handled by the hypervisor. supported from kernel 4.18

经过实际测试，在使用 rdma 网卡通信时，上述两值并没有明显的计数增加，而观察到 ethtool counters rx_bytes_phy / tx_bytes_phy 才有与实际流量相当的计数增加。所以可能早期（或者内核？） ifconfig 中获取到的数值，仅是网卡的其中某个计数器，而那个计数器又并不能代表真正的实际情况，所以可能 ifconfig 中的数值会是个误导。我们应使用 ethtool -S ens[xxx] 查看 rdma 网卡的统计信息。

rx_bytes_phy, ConnectX-3 naming : rx_bytes
例如在 cx3 网卡时，当前主机安装的 ifconfig，取的的确就是“正确”的；而在 cx4/5/6 网卡，rx_bytes 的物理意义发生了变化，变为了记录 Representor only: bytes received, that were handled by the hypervisor. supported from kernel 4.18

交换机端口常用查询命令

https://support.huawei.com/enterprise/zh/doc/EDOC1100153180/e4418444

https://www.infoq.cn/article/o3rnxl2trb1gxemmxdoj

egress/ingress port

查看是否出现丢包

1	display interface 100GE1/0/1

Input:                                                                      
  Unicast:            11657620879,   Multicast:                     695     
  Broadcast:                    0,   Jumbo:                           0     
  Discard:                      0,   Frames:                          0     
  Pause:                        0                                           

  Total Error:                  0                                           
  CRC:                          0,   Giants:                          0     
  Jabbers:                      0,   Fragments:                       0     
  Runts:                        0,   DropEvents:                      0     
  Alignments:                   0,   Symbols:                         0     
  Ignoreds:                     0                                           

Output:                                                                     
  Unicast:              536390526,   Multicast:                     695     
  Broadcast:                    0,   Jumbo:                           0     
  Discard:                      0,   Buffers Purged:                  0     
  Pause:                 18913700

上行方向丢包 Input

1	display qos buffer ingress-statistics interface 100GE1/0/1

查看入方向统计值

Interface                   Dropped        Drop Rate   Drop Time                
                     (Packets/Bytes)        (pps/bps)                           
----------------------------------------------------------------                
100GE1/0/1                         0                0          -                
                                   0                0                           
----------------------------------------------------------------

下行出现丢包 Output

1	display qos queue statistics interface 100GE1/0/1

查看队列统计情况

----------------------------------------------------------------------------------------------                                           
    4         0                   6             0                   0             0          -                                                                 
      100000000                1092             0                   0             0                                                                            
----------------------------------------------------------------------------------------------

查看接口出方向队列的缓存使用情况

1	display qos buffer egress-usage interface 100GE1/0/1

可以查看无损队列

Egress Buffer Usage (KBytes) on single queue: (Current/Total)                   
*: Dynamic threshold                                                            
------------------------------------------------------------                    
Interface       Queue   Type        Guaranteed        Shared                    

------------------------------------------------------------                    
100GE1/0/1          0   Lossy              0/1          0/5*                    
                    1   Lossy              0/1          0/5*                    
                    2   Lossy              0/1          0/5*                    
                    3   Lossless           0/1       0/10156                    
                    4   Lossy              0/1          0/5*                    
                    5   Lossy              0/1          0/5*                    
                    6   Lossy              0/1          0/5*                    
                    7   Lossy              0/1          0/5*                    
------------------------------------------------------------                    
Lossless Service Pool (cells):  0/0                                             
Lossy    Service Pool (cells):  0/151136                                        
------------------------------------------------------------

reading nccl

发表于 2023-05-03 更新于 2023-06-11

本文字数： 56k 阅读时长 ≈ 51 分钟

以如下 nccl 版本为例分析

pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime
pytorch 1.13, cuda 11.6.2, nccl 2.14.3

1	python -c "import torch;print(torch.cuda.nccl.version())"

rdma 和网络相关知识

https://www.doc.ic.ac.uk/~jgiceva/teaching/ssc18-rdma.pdf, rdma tutorial
https://www.openfabrics.org/images/eventpresos/workshops2013/IBUG/2013_UserDay_Thur_1400_Bob-Russell-programming-concepts.pdf, ofa rdma program, 非常好, 五星推荐
https://blog.zhaw.ch/icclab/infiniband-an-introduction-simple-ib-verbs-program-with-rdma-write/, 非常好
https://insujang.github.io/2020-02-09/introduction-to-programming-infiniband/, qp 状态转换流程
https://www.rdmamojo.com/2012/05/05/qp-state-machine/, qp 状态详述
https://arthurchiao.art/blog/linux-net-stack-implementation-rx-zh, linux rx 原理及内核实现
https://support.huawei.com/enterprise/zh/doc/EDOC1100197616/3dfff4ec, HPC 集群 mlnx 网卡巡检
https://support.huawei.com/enterprise/zh/doc/EDOC1100197616/37b637af, HPC 集群交换机 roce 流量信息巡检

缩写解释

CA: channel adapter, 即 rdma (infiniband) 网卡; HCA, host channel adapter
RoCE: rdma over Converged Ethernet; rdma 的一种高性价比实现 (对于云厂商来说, 一般而言只要替换支持 RoCE 的网卡即可, 传输/交换网络设备不需要更换, 即可支持业务层 rdma)
mlnx ofed: ib 驱动
rdma: remote direct memory access, 远程直接访问内存
rdma program
- qp: queue pair, 包括 send queue, recv queue; Rtr, ready to receive; Rts, ready to send; qp 状态机; qp 可以理解为 rdma 中的 client
- cq: completion queue, 用于获取 send/recv 结果
- mr: memory region, rdma 可操作的 memory region; rkey, 用于远程访问 mr 的 key; lkey, 用于本地读取 mr 的 key
- pd: protect domain, 用于关联 mr
- wr: work request

nccl net

net.h, ncclNet 里边相当于网络接口定义, 例如定义了 ncclNetIsend 等接口

而这些接口的具体又会有 netSocketIsend 与 netIBIsend 的实现，在如下两个对象中

ncclNetSocket
ncclNetIb

net 整体流程概述

socket server

listen

send

connect
send check
send
test

recv

accept
recv check
recv
test

nccl net ib

https://github.com/Mellanox/nv_peer_memory
https://download.nvidia.com/XFree86/Linux-x86_64/470.42.01/README/nvidia-peermem.html, This module, originally maintained by Mellanox on GitHub, is now included with the NVIDIA Linux GPU driver

net ib 相比与 net socket 的优势在于 net ib 通过 ibverbs api, 实现了 rdma (当然前提是主机上有能支持 rdma 的网卡); rdma 通过网卡直接读写远端的内存 (HOST MEM); 也就是达到了所谓的 OS bypass/CPU offload 效果, 能极大的提升通过网络传输数据的效率
如果安装了 nv_peer_mem mod, net ib 可以通过网卡直接读写远端的 GPU MEM (DEV MEM); 另外高版本 (例如 470) 的 gpu driver 自带了 nvidia-peermem mod, 可以替代 nv_peer_mem 提供 GPU Direct RDMA 功能

net ib 整体流程概述

ib connect, send check / ib accept, recv check

在这个阶段, 主要是发送方与接收方通过 socket 完成 rdma 通信初始化工作, 具体如 qp 初始化

ib connect/ib accept

recv/send

在这个阶段, 主要是接收方将待接收数据的 mem addr 通过 fifo 数据结构 rdma 写入到发送方; 随后发送方根据 fifo 数据结构, 将数据 rdma 写入到接收方指示的 mem addr

ib recv/ib send

socket server

listen: 监听 bootstrap 网卡端口, 启动 socket server

send

connect: 创建 pd/cq (comm->verbs), mr (comm->fifo), qp (comm->qps); 最终将 qpInfo (包括 fifo addr, fifo rkey, qp 等信息) 通过 socket 发送到 receiver
send check: 通过 socket 接收 receiver 返回的 qpInfo; 使用 receiver qpInfo 完成 sender qp (本端 qp) 的状态转移; 可以理解为建立 sender qp 与 receiver qp 的连接 (Rtr, Rts); 通过 socket 发送 1 (ready) 到 receiver; send comm ready
send: ibv_post_send, 往 qp 的 send queue 发 wr (cp 会通知是否完成); 发送 fifo 中的 ncclIbSendFifo slot; 如果 slot ready, sender 根据 ncclIbSendFifo 中 receiver 写入的 MEM 地址, 直接将待发送的 data 写入到 receiver 指示的 MEM 地址
test: ibv_poll_cq, 从 cq 中获取已完成的 wr, 判断发送请求是否完成

recv

accept: 接收 sender 的 qpInfo (包括 sender 的 fifo addr, fifo rkey, qp 等信息); 创建 pd/cq (rComm->verbs), mr (rComm->remFifo.elems), qp (rComm->qps); 使用 sender qpInfo 完成 receiver qp (本端 qp) 的状态转移; 最终将 qpInfo (包括 fifo addr, fifo rkey, qp 等信息) 通过 socket 发送到 sender
recv check: 接收 1 (ready); recv comm ready
recv: ibv_post_recv 往 qp 的 recv queue 发 wr (cq 会通知是否完成); ncclIbPostFifo, 将 fifo 信息 (receiver 准备接收数据的 MEM 地址, rkey 等信息) rdma 到 sender fifo
test: ibv_poll_cq, 从 cq 中获取已完成的 wr, 判断接收请求是否完成

nccl recv 流程说明

交由 rdma 操作的 MEM 需要先 reg 为 memory region (mr), reg 动作在 recvProxyConnect 方法中执行

1	NCCLCHECK(ncclNetRegMr(comm, resources->netRecvComm, resources->buffers[p], resources->buffSizes[p], NCCL_NET_MAP_DEV_MEM(map, buffs[p]) ? NCCL_PTR_CUDA : NCCL_PTR_HOST, &resources->mhandles[p]));

随后执行 recvProxyProgress 方法, 涉及到网络通信的, 最终通过 ncclNetIrecv 方法执行接收数据的实现; ncclNetIrecv 在 net ib 中的实现为 ncclIbIrecv

nccl 流程说明

nccl 完整通信实现看下来的机制大致是

nccl group 启动
nccl proxy service 启动
通信算子 entry 加入 ncclTask 队列
group 提交通信算子 entry 到 proxy
proxy 使用通信算子对应的具体实现来完成参数传递

c++ 代码较为难读 … 先大致理解如上

nccl ib 相关环境变量说明

https://docs.nvidia.com/deeplearning/nccl/archives/nccl_2143/user-guide/docs/env.html
https://www.rdmamojo.com/2013/01/12/ibv_modify_qp/

RoCE

NCCL_SOCKET_IFNAME: 指定 ib bootstrap 网卡 (socket)
NCCL_IB_HCA: 指定 ib 网卡
NCCL_IB_RETRY_CNT: qp.retry_cnt, A 3 bits value of the total number of times that the QP will try to resend the packets before reporting an error because the remote side doesn’t answer in the primary path
NCCL_IB_TIMEOUT: qp.timeout, The minimum timeout that a QP waits for ACK/NACK from remote QP before retransmitting the packet.
NCCL_IB_TC: qp.ah_attr.grh.traffic_class, traffic_class, Using this value, the originator of the packets specifies the required delivery priority for handling them by the routers
NCCL_IB_GID_INDEX: qp.ah_attr.grh.sgid_index, An index in the port’s GID table that will be used to identify the originator of the packet

net_ib.cc 代码注释

net_ib.cc 代码注释, 尽量理解, 难免有误, 持续学习

https://github.com/NVIDIA/nccl/blob/v2.14.3-1/src/transport/net_ib.cc

/*************************************************************************
 * Copyright (c) 2016-2022, NVIDIA CORPORATION. All rights reserved.
 *
 * See LICENSE.txt for license information
 ************************************************************************/

#include "nccl.h"
#include "core.h"
#include "socket.h"
#include "net.h"
#include "graph.h"
#include "utils.h"
#include "param.h"

#include <assert.h>
#include <pthread.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <poll.h>
#include <sys/types.h>
#include <unistd.h>
#define ENABLE_TIMER 0
#include "timer.h"

#include "ibvwrap.h"

#define MAXNAMESIZE 64
static char ncclIbIfName[MAX_IF_NAME_SIZE+1];
static union ncclSocketAddress ncclIbIfAddr;

// rdma (infiniband) 中的 memory region, 即提供给 rdma 网卡直接操作 (读/写) 的 pin memory
struct ncclIbMr {
  uintptr_t addr;
  int pages;
  int refs;
  ibv_mr *mr;
};

// ncclIbMrCache
// 维护 ib mr
struct ncclIbMrCache {
  struct ncclIbMr *slots;
  int capacity, population;
};

static int ncclNIbDevs = -1;

// ib 设备信息, 聚合数据结构
struct alignas(64) ncclIbDev {
  pthread_mutex_t lock;
  int device;
  uint64_t guid;
  uint8_t port;
  uint8_t link;
  int speed;
  ibv_context* context;
  int pdRefs;
  ibv_pd* pd;
  char devName[MAXNAMESIZE];
  char* pciPath;
  int realPort;
  int maxQp;
  struct ncclIbMrCache mrCache;
};

#define MAX_IB_PORT 15
struct userIbDev {
  char devName[MAXNAMESIZE];
  uint16_t port_en;
};

#define MAX_IB_DEVS 16
struct ncclIbDev ncclIbDevs[MAX_IB_DEVS];
struct userIbDev userIbDevs[MAX_IB_DEVS];
pthread_mutex_t ncclIbLock = PTHREAD_MUTEX_INITIALIZER;
static int ncclIbRelaxedOrderingEnabled = 0;

NCCL_PARAM(IbGidIndex, "IB_GID_INDEX", 0);
NCCL_PARAM(IbTimeout, "IB_TIMEOUT", 18);
NCCL_PARAM(IbRetryCnt, "IB_RETRY_CNT", 7);
NCCL_PARAM(IbPkey, "IB_PKEY", 0);
NCCL_PARAM(IbUseInline, "IB_USE_INLINE", 0);
NCCL_PARAM(IbSl, "IB_SL", 0);
NCCL_PARAM(IbTc, "IB_TC", 0);
NCCL_PARAM(IbArThreshold, "IB_AR_THRESHOLD", 8192);
NCCL_PARAM(IbPciRelaxedOrdering, "IB_PCI_RELAXED_ORDERING", 2);

pthread_t ncclIbAsyncThread;
// https://www.rdmamojo.com/2012/08/11/ibv_get_async_event/
// 这里常见错误 NET/IB : Got async event : port xxx
// 一般是 rdma 网卡 down
static void* ncclIbAsyncThreadMain(void* args) {
  struct ibv_context* context = (struct ibv_context*)args;
  while (1) {
    struct ibv_async_event event;
    if (ncclSuccess != wrap_ibv_get_async_event(context, &event)) { break; }
    char *str;
    if (ncclSuccess != wrap_ibv_event_type_str(&str, event.event_type)) { break; }
    if (event.event_type != IBV_EVENT_COMM_EST)
      WARN("NET/IB : Got async event : %s", str);
    if (ncclSuccess != wrap_ibv_ack_async_event(&event)) { break; }
  }
  return NULL;
}

NCCL_PARAM(IbDisable, "IB_DISABLE", 0);

static ncclResult_t ncclIbGetPciPath(char* devName, char** path, int* realPort) {
  char devicePath[PATH_MAX];
  snprintf(devicePath, PATH_MAX, "/sys/class/infiniband/%s/device", devName);
  char* p = realpath(devicePath, NULL);
  if (p == NULL) {
    WARN("Could not find real path of %s (%s)", devName, devicePath);
  } else {
    // Merge multi-port NICs into the same PCI device
    p[strlen(p)-1] = '0';
    // Also merge virtual functions (VF) into the same device
    p[strlen(p)-3] = '0';
    // And keep the real port aside (the ibv port is always 1 on recent cards)
    *realPort = 0;
    for (int d=0; d<ncclNIbDevs; d++) {
      if (strcmp(p, ncclIbDevs[d].pciPath) == 0) (*realPort)++;
    }
  }
  *path = p;
  return ncclSuccess;
}

static int ibvWidths[] = { 1, 4, 8, 12, 2 };
static int ibvSpeeds[] = { 2500, 5000, 10000, 10000, 14000, 25000, 50000 };
static int firstBitSet(int val, int max) {
  int i = 0;
  while (i<max && ((val & (1<<i)) == 0)) i++;
  return i;
}
static int ncclIbWidth(int width) {
  return ibvWidths[firstBitSet(width, sizeof(ibvWidths)/sizeof(int)-1)];
}
static int ncclIbSpeed(int speed) {
  return ibvSpeeds[firstBitSet(speed, sizeof(ibvSpeeds)/sizeof(int)-1)];
}

// Determine whether RELAXED_ORDERING is enabled and possible
static int ncclIbRelaxedOrderingCapable(void) {
  int roMode = ncclParamIbPciRelaxedOrdering();
  ncclResult_t r = ncclInternalError;
  if (roMode == 1 || roMode == 2) {
    // Query IBVERBS_1.8 API - needed for IBV_ACCESS_RELAXED_ORDERING support
    r = wrap_ibv_reg_mr_iova2(NULL, NULL, NULL, 0, 0, 0);
  }
  return r == ncclInternalError ? 0 : 1;
}

ncclResult_t ncclIbInit(ncclDebugLogger_t logFunction) {
  if (ncclParamIbDisable()) return ncclInternalError;
  static int shownIbHcaEnv = 0;
  if(wrap_ibv_symbols() != ncclSuccess) { return ncclInternalError; }

  if (ncclNIbDevs == -1) {
    pthread_mutex_lock(&ncclIbLock);
    wrap_ibv_fork_init();
    if (ncclNIbDevs == -1) {
      ncclNIbDevs = 0;
      if (ncclFindInterfaces(ncclIbIfName, &ncclIbIfAddr, MAX_IF_NAME_SIZE, 1) != 1) {
        WARN("NET/IB : No IP interface found.");
        return ncclInternalError;
      }

      // Detect IB cards
      int nIbDevs;
      struct ibv_device** devices;

      // Check if user defined which IB device:port to use
      char* userIbEnv = getenv("NCCL_IB_HCA");
      if (userIbEnv != NULL && shownIbHcaEnv++ == 0) INFO(NCCL_NET|NCCL_ENV, "NCCL_IB_HCA set to %s", userIbEnv);
      struct netIf userIfs[MAX_IB_DEVS];
      bool searchNot = userIbEnv && userIbEnv[0] == '^';
      if (searchNot) userIbEnv++;
      bool searchExact = userIbEnv && userIbEnv[0] == '=';
      if (searchExact) userIbEnv++;
      int nUserIfs = parseStringList(userIbEnv, userIfs, MAX_IB_DEVS);

      if (ncclSuccess != wrap_ibv_get_device_list(&devices, &nIbDevs)) return ncclInternalError;

      for (int d=0; d<nIbDevs && ncclNIbDevs<MAX_IB_DEVS; d++) {
        struct ibv_context * context;

        // 打开 ib 设备
        if (ncclSuccess != wrap_ibv_open_device(&context, devices[d]) || context == NULL) {
          WARN("NET/IB : Unable to open device %s", devices[d]->name);
          continue;
        }
        int nPorts = 0;
        struct ibv_device_attr devAttr;
        memset(&devAttr, 0, sizeof(devAttr));
        
        // 查询 ib 设备详情
        if (ncclSuccess != wrap_ibv_query_device(context, &devAttr)) {
          WARN("NET/IB : Unable to query device %s", devices[d]->name);
          if (ncclSuccess != wrap_ibv_close_device(context)) { return ncclInternalError; }
          continue;
        }

        // 获取 ib 设备 port 详情
        for (int port = 1; port <= devAttr.phys_port_cnt; port++) {
          struct ibv_port_attr portAttr;
          if (ncclSuccess != wrap_ibv_query_port(context, port, &portAttr)) {
            WARN("NET/IB : Unable to query port %d", port);
            continue;
          }

          if (portAttr.state != IBV_PORT_ACTIVE) continue; // port 不是 active 的时候会被忽略
          if (portAttr.link_layer != IBV_LINK_LAYER_INFINIBAND
              && portAttr.link_layer != IBV_LINK_LAYER_ETHERNET) continue;

          // check against user specified HCAs/ports
          if (! (matchIfList(devices[d]->name, port, userIfs, nUserIfs, searchExact) ^ searchNot)) {
            continue;
          }

          // NET/IB: [0] mlx5_2:0/RoCE
          TRACE(NCCL_INIT|NCCL_NET,"NET/IB: [%d] %s:%d/%s ", d, devices[d]->name, port,
              portAttr.link_layer == IBV_LINK_LAYER_INFINIBAND ? "IB" : "RoCE");

          pthread_mutex_init(&ncclIbDevs[ncclNIbDevs].lock, NULL);
          ncclIbDevs[ncclNIbDevs].device = d;
          ncclIbDevs[ncclNIbDevs].guid = devAttr.sys_image_guid;
          ncclIbDevs[ncclNIbDevs].port = port;
          ncclIbDevs[ncclNIbDevs].link = portAttr.link_layer;
          ncclIbDevs[ncclNIbDevs].speed = ncclIbSpeed(portAttr.active_speed) * ncclIbWidth(portAttr.active_width);
          ncclIbDevs[ncclNIbDevs].context = context;
          ncclIbDevs[ncclNIbDevs].pdRefs = 0;
          ncclIbDevs[ncclNIbDevs].pd = NULL;
          strncpy(ncclIbDevs[ncclNIbDevs].devName, devices[d]->name, MAXNAMESIZE);
          NCCLCHECK(ncclIbGetPciPath(ncclIbDevs[ncclNIbDevs].devName, &ncclIbDevs[ncclNIbDevs].pciPath, &ncclIbDevs[ncclNIbDevs].realPort));
          ncclIbDevs[ncclNIbDevs].maxQp = devAttr.max_qp;
          ncclIbDevs[ncclNIbDevs].mrCache.capacity = 0;
          ncclIbDevs[ncclNIbDevs].mrCache.population = 0;
          ncclIbDevs[ncclNIbDevs].mrCache.slots = NULL;

          // 初始化好 ib 设备信息后
          // 启动 ncclIbAsyncThreadMain thread, 监听 ib 设备事件
          pthread_create(&ncclIbAsyncThread, NULL, ncclIbAsyncThreadMain, context);
          ncclSetThreadName(ncclIbAsyncThread, "NCCL IbAsync %2d", ncclNIbDevs);
          pthread_detach(ncclIbAsyncThread); // will not be pthread_join()'d
          ncclNIbDevs++;
          nPorts++;
        }
        if (nPorts == 0 && ncclSuccess != wrap_ibv_close_device(context)) { return ncclInternalError; }
      }
      if (nIbDevs && (ncclSuccess != wrap_ibv_free_device_list(devices))) { return ncclInternalError; };
    }
    if (ncclNIbDevs == 0) {
      INFO(NCCL_INIT|NCCL_NET, "NET/IB : No device found.");
    } else {
      char line[1024];
      line[0] = '\0';
      // Determine whether RELAXED_ORDERING is enabled and possible
      ncclIbRelaxedOrderingEnabled = ncclIbRelaxedOrderingCapable();
      for (int d=0; d<ncclNIbDevs; d++) {
        snprintf(line+strlen(line), 1023-strlen(line), " [%d]%s:%d/%s", d, ncclIbDevs[d].devName,
            ncclIbDevs[d].port, ncclIbDevs[d].link == IBV_LINK_LAYER_INFINIBAND ? "IB" : "RoCE");
      }
      line[1023] = '\0';
      char addrline[SOCKET_NAME_MAXLEN+1];

      // OOB 个人理解是用于辅助完成 ib 通信初始化的网卡
      INFO(NCCL_INIT|NCCL_NET, "NET/IB : Using%s %s; OOB %s:%s", line, ncclIbRelaxedOrderingEnabled ? "[RO]" : "",
           ncclIbIfName, ncclSocketToString(&ncclIbIfAddr, addrline));
    }
    pthread_mutex_unlock(&ncclIbLock);
  }
  return ncclSuccess;
}

ncclResult_t ncclIbDevices(int* ndev) {
  *ndev = ncclNIbDevs;
  return ncclSuccess;
}

// Detect whether GDR can work on a given NIC with the current CUDA device
// Returns :
// ncclSuccess : GDR works
// ncclSystemError : no module or module loaded but not supported by GPU
ncclResult_t ncclIbGdrSupport(int ibDev) {
  static int moduleLoaded = -1;
  if (moduleLoaded == -1) {
    // nv_mem or nvidia-peermem 都能开启 GDR

    // Check for the nv_peer_mem module being loaded
    moduleLoaded = ((access("/sys/kernel/mm/memory_peers/nv_mem/version", F_OK) == -1) &&
                    // Also support the new nvidia-peermem module
                    (access("/sys/kernel/mm/memory_peers/nvidia-peermem/version", F_OK) == -1)) ? 0 : 1;
  }
  if (moduleLoaded == 0) return ncclSystemError;
  return ncclSuccess;
}

// Detect whether DMA-BUF support is present in the kernel
// Returns :
// ncclSuccess : DMA-BUF support is available
// ncclSystemError : DMA-BUF is not supported by the kernel
ncclResult_t ncclIbDmaBufSupport(int dev) {
  static int dmaBufSupported = -1;
  if (dmaBufSupported == -1) {
    ncclResult_t res;
    struct ibv_pd* pd;
    struct ibv_context* ctx;
    ctx = ncclIbDevs[dev].context;
    NCCLCHECKGOTO(wrap_ibv_alloc_pd(&pd, ctx), res, failure);
    // Test kernel DMA-BUF support with a dummy call (fd=-1)
    (void) wrap_direct_ibv_reg_dmabuf_mr(pd, 0ULL/*offset*/, 0ULL/*len*/, 0ULL/*iova*/, -1/*fd*/, 0/*flags*/);
    // ibv_reg_dmabuf_mr() will fail with EOPNOTSUPP/EPROTONOSUPPORT if not supported (EBADF otherwise)
    dmaBufSupported = (errno != EOPNOTSUPP && errno != EPROTONOSUPPORT) ? 1 : 0;
    NCCLCHECKGOTO(wrap_ibv_dealloc_pd(pd), res, failure);
  }
  if (dmaBufSupported == 0) return ncclSystemError;
  return ncclSuccess;
failure:
  dmaBufSupported = 0;
  return ncclSystemError;
}

static ncclResult_t GetSocketAddr(union ncclSocketAddress* addr) {
  memcpy(addr, &ncclIbIfAddr, sizeof(*addr));
  return ncclSuccess;
}

#define NCCL_NET_IB_MAX_RECVS 8

ncclResult_t ncclIbGetProperties(int dev, ncclNetProperties_t* props) {
  props->name = ncclIbDevs[dev].devName;
  props->pciPath = ncclIbDevs[dev].pciPath;
  props->guid = ncclIbDevs[dev].guid;
  props->ptrSupport = NCCL_PTR_HOST;
  if (ncclIbGdrSupport(dev) == ncclSuccess) {
    props->ptrSupport |= NCCL_PTR_CUDA; // GDR support via nv_peermem
  }
  if (ncclIbDmaBufSupport(dev) == ncclSuccess) {
    props->ptrSupport |= NCCL_PTR_DMABUF; // GDR support via DMA-BUF
  }
  props->speed = ncclIbDevs[dev].speed;
  props->latency = 0; // Not set
  props->port = ncclIbDevs[dev].port + ncclIbDevs[dev].realPort;
  props->maxComms = ncclIbDevs[dev].maxQp;
  props->maxRecvs = NCCL_NET_IB_MAX_RECVS;
  return ncclSuccess;
}

// We need to support NCCL_NET_MAX_REQUESTS for each concurrent receive
#define MAX_REQUESTS (NCCL_NET_MAX_REQUESTS*NCCL_NET_IB_MAX_RECVS)
static_assert(MAX_REQUESTS <= 256, "request id are encoded in wr_id and we need up to 8 requests ids per completion");

#define NCCL_IB_MAX_QPS 128

// 完成 rdma (infiniband) 需要的信息; 例如 queue pair num
struct ncclIbQpInfo {
  uint32_t lid;
  uint8_t ib_port;
  uint8_t link_layer;
  uint32_t qpn[NCCL_IB_MAX_QPS];

  // For RoCE
  uint64_t spn;
  uint64_t iid;
  enum ibv_mtu mtu;

  // FIFO RDMA info
  uint32_t fifoRkey;
  uint64_t fifoAddr;
};

enum ncclIbCommState {
  ncclIbCommStateStart = 0,
  ncclIbCommStateConnect = 1,
  ncclIbCommStateAccept = 3,
  ncclIbCommStateSend = 4,
  ncclIbCommStateRecv = 5,
  ncclIbCommStateConnected = 6,
};

struct ncclIbCommStage {
  enum ncclIbCommState state;
  int offset;
  void* buffer;
  void* comm;
};

struct ncclIbHandle {
  union ncclSocketAddress connectAddr; // Filled by the target
  struct ncclIbCommStage stage; // Used by the other side when connecting
};

#define NCCL_NET_IB_REQ_UNUSED 0
#define NCCL_NET_IB_REQ_SEND 1
#define NCCL_NET_IB_REQ_RECV 2
#define NCCL_NET_IB_REQ_FLUSH 3

// request
struct ncclIbRequest {
  struct ncclIbVerbs* verbs;
  int type;
  int events;
  union ncclSocketAddress *addr;
  int nreqs;
  union {
    struct {
      int size;
      void* data;
      uint32_t lkey;
      int offset;
    } send;
    struct {
      int sizes[NCCL_NET_IB_MAX_RECVS];
    } recv;
  };
};

struct ncclIbVerbs {
  // 设备 index
  int dev;

  // rdma (infiniband) 中的 protect domain
  struct ibv_pd* pd; // duplicate of ncclIbDevs[dev].pd

  // rdma (infiniband) 中的 completion queue
  struct ibv_cq* cq;

  uint64_t pad[1];

  // ib request
  struct ncclIbRequest reqs[MAX_REQUESTS];
};

struct ncclIbListenComm {
  int dev;
  struct ncclSocket sock;
  struct ncclIbCommStage stage;
};

// ncclIbSendFifo
struct ncclIbSendFifo {
  uint64_t addr; // 远端 mem addr
  int      size; // 数据大小
  uint32_t rkey; // remote key, 用于 rdma

  uint32_t nreqs;
  uint32_t tag;
  uint64_t idx;
};


struct ncclIbSendComm {
  struct ncclIbVerbs verbs;
  struct ncclIbSendFifo fifo[MAX_REQUESTS][NCCL_NET_IB_MAX_RECVS];
  uint64_t fifoHead;
  struct ncclIbRequest* fifoReqs[MAX_REQUESTS][NCCL_NET_IB_MAX_RECVS];
  struct ibv_send_wr wrs[NCCL_NET_IB_MAX_RECVS+1];
  struct ibv_sge sges[NCCL_NET_IB_MAX_RECVS];
  struct ncclSocket sock;

  int ready;
  struct ibv_qp* qps[NCCL_IB_MAX_QPS];
  int nqps;
  struct ibv_mr* fifoMr;
};
// The SendFifo needs to be 32-byte aligned and each element needs
// to be a 32-byte multiple, so that an entry does not get split and
// written out of order when IB Relaxed Ordering is enabled
static_assert((offsetof(struct ncclIbSendComm, fifo) % 32) == 0, "ncclIbSendComm fifo must be 32-byte aligned");
static_assert((sizeof(struct ncclIbSendFifo) % 32) == 0, "ncclIbSendFifo element size must be 32-byte multiples");

struct ncclIbGpuFlush {
  int enabled;
  int hostMem;
  struct ibv_mr* hostMr;
  struct ibv_sge sge;
  struct ibv_qp* qp;
};

// remote fifo 队列, 队列中的 item 为 ncclIbSendFifo
struct ncclIbRemFifo {
  
  struct ncclIbSendFifo elems[MAX_REQUESTS][NCCL_NET_IB_MAX_RECVS];

  // 队尾
  uint64_t fifoTail;

  uint64_t addr;
  uint32_t rkey;
  uint32_t flags;
  struct ibv_mr* mr;
  struct ibv_sge sge;
};

struct ncclIbRecvComm {
  struct ncclIbVerbs verbs;
  struct ncclIbRemFifo remFifo;
  struct ncclSocket sock;
  int ready;
  struct ibv_qp* qps[NCCL_IB_MAX_QPS];
  int nqps;
  struct ncclIbGpuFlush gpuFlush;
};
static_assert((offsetof(struct ncclIbRecvComm, remFifo) % 32) == 0, "ncclIbSendComm fifo must be 32-byte aligned");

NCCL_PARAM(IbQpsPerConn, "IB_QPS_PER_CONNECTION", 1);

ncclResult_t ncclIbInitVerbs(int dev, struct ibv_context* ctx, struct ncclIbVerbs* verbs) {
  verbs->dev = dev;

  pthread_mutex_lock(&ncclIbDevs[dev].lock);
  if (0 == ncclIbDevs[dev].pdRefs++) {
    ncclResult_t res;
    // 分配 protect domain, pd
    NCCLCHECKGOTO(wrap_ibv_alloc_pd(&ncclIbDevs[dev].pd, ctx), res, failure);
    if (0) {
    failure:
      pthread_mutex_unlock(&ncclIbDevs[dev].lock);
      return res;
    }
  }
  verbs->pd = ncclIbDevs[dev].pd;
  pthread_mutex_unlock(&ncclIbDevs[dev].lock);

  // 创建 completion queue
  // Recv requests can generate 2 completions (one for the post FIFO, one for the Recv).
  NCCLCHECK(wrap_ibv_create_cq(&verbs->cq, ctx, 2*MAX_REQUESTS*ncclParamIbQpsPerConn(), NULL, NULL, 0));
  return ncclSuccess;
}

ncclResult_t ncclIbDestroyVerbs(struct ncclIbVerbs* verbs) {
  ncclResult_t res;
  NCCLCHECK(wrap_ibv_destroy_cq(verbs->cq));

  pthread_mutex_lock(&ncclIbDevs[verbs->dev].lock);
  if (0 == --ncclIbDevs[verbs->dev].pdRefs) {
    NCCLCHECKGOTO(wrap_ibv_dealloc_pd(ncclIbDevs[verbs->dev].pd), res, returning);
  }
  res = ncclSuccess;
returning:
  pthread_mutex_unlock(&ncclIbDevs[verbs->dev].lock);
  return res;
}

// 创建 qp, 用于数据发送, 接收

// cq 关联到 qp
ncclResult_t ncclIbCreateQp(uint8_t ib_port, struct ncclIbVerbs* verbs, int access_flags, struct ibv_qp** qp) {
  struct ibv_qp_init_attr qpInitAttr;
  memset(&qpInitAttr, 0, sizeof(struct ibv_qp_init_attr));
  qpInitAttr.send_cq = verbs->cq;
  qpInitAttr.recv_cq = verbs->cq;

  // 注意此处, NCCL 使用的是 RC 传输
  qpInitAttr.qp_type = IBV_QPT_RC;

  // We might send 2 messages per send (RDMA and RDMA_WITH_IMM)
  qpInitAttr.cap.max_send_wr = 2*MAX_REQUESTS;
  qpInitAttr.cap.max_recv_wr = MAX_REQUESTS;
  qpInitAttr.cap.max_send_sge = 1;
  qpInitAttr.cap.max_recv_sge = 1;
  qpInitAttr.cap.max_inline_data = ncclParamIbUseInline() ? sizeof(struct ncclIbSendFifo) : 0;

  // 关联 qp 到 pd, protect domain
  // 也就是说 pd 包括 qp, cq
  NCCLCHECK(wrap_ibv_create_qp(qp, verbs->pd, &qpInitAttr));

  struct ibv_qp_attr qpAttr;
  memset(&qpAttr, 0, sizeof(struct ibv_qp_attr));
  // 注意此处 INIT
  qpAttr.qp_state = IBV_QPS_INIT;
  qpAttr.pkey_index = ncclParamIbPkey();
  qpAttr.port_num = ib_port;
  qpAttr.qp_access_flags = access_flags;

  // 可以回顾 QP 状态转换图
  NCCLCHECK(wrap_ibv_modify_qp(*qp, &qpAttr, IBV_QP_STATE | IBV_QP_PKEY_INDEX | IBV_QP_PORT | IBV_QP_ACCESS_FLAGS));
  return ncclSuccess;
}

// qp 状态转换
// read to read (receive)
// qp 接收数据时, 如下为相关的环境变量
// NCCL_IB_GID_INDEX, roce only
// NCCL_IB_TC, roce only
// NCCL_IB_SL
ncclResult_t ncclIbRtrQp(struct ibv_qp* qp, uint32_t qpn, struct ncclIbQpInfo* info) {
  struct ibv_qp_attr qpAttr;
  memset(&qpAttr, 0, sizeof(struct ibv_qp_attr));
  qpAttr.qp_state = IBV_QPS_RTR;
  qpAttr.path_mtu = info->mtu;
  qpAttr.dest_qp_num = qpn;
  qpAttr.rq_psn = 0;
  qpAttr.max_dest_rd_atomic = 1;
  qpAttr.min_rnr_timer = 12;

  // roce 时, 如下两个环境变量有效
  // NCCL_IB_GID_INDEX
  // NCCL_IB_TC
  if (info->link_layer == IBV_LINK_LAYER_ETHERNET) {
    qpAttr.ah_attr.is_global = 1;
    qpAttr.ah_attr.grh.dgid.global.subnet_prefix = info->spn;
    qpAttr.ah_attr.grh.dgid.global.interface_id = info->iid;
    qpAttr.ah_attr.grh.flow_label = 0;
    qpAttr.ah_attr.grh.sgid_index = ncclParamIbGidIndex();
    qpAttr.ah_attr.grh.hop_limit = 255;
    qpAttr.ah_attr.grh.traffic_class = ncclParamIbTc();
  } else {
    qpAttr.ah_attr.is_global = 0;
    qpAttr.ah_attr.dlid = info->lid;
  }

  // NCCL_IB_SL 对于 infiniband or ethernet 都有效
  qpAttr.ah_attr.sl = ncclParamIbSl();
  qpAttr.ah_attr.src_path_bits = 0;
  qpAttr.ah_attr.port_num = info->ib_port;

  NCCLCHECK(wrap_ibv_modify_qp(qp, &qpAttr, IBV_QP_STATE | IBV_QP_AV | IBV_QP_PATH_MTU | IBV_QP_DEST_QPN | IBV_QP_RQ_PSN | IBV_QP_MAX_DEST_RD_ATOMIC | IBV_QP_MIN_RNR_TIMER));
  return ncclSuccess;
}

// qp 状态转换
// read to send
// qp 发送数据时, 如下为相关的环境变量
// NCCL_IB_TIMEOUT, 注意该 env 默认值为 18, 即 timeout 时间为 1073742 usec (1.07 sec). 调整该 env 在 roce 网络丢包情况下, 对通信性能影响较大, 例如假若调整为 25, 则 timeout 时间为 137000000 usec (137 sec), 2min+
// NCCL_IB_RETRY_CNT, 注意此 env 默认值为 7, A 3 bits value of the total number of times that the QP will try to resend the packets before reporting an error because the remote side doesn't answer in the primary path
ncclResult_t ncclIbRtsQp(struct ibv_qp* qp) {
  struct ibv_qp_attr qpAttr;
  memset(&qpAttr, 0, sizeof(struct ibv_qp_attr));
  qpAttr.qp_state = IBV_QPS_RTS;
  qpAttr.timeout = ncclParamIbTimeout();
  qpAttr.retry_cnt = ncclParamIbRetryCnt();
  qpAttr.rnr_retry = 7; // rnr retry infinite
  qpAttr.sq_psn = 0;
  qpAttr.max_rd_atomic = 1;
  NCCLCHECK(wrap_ibv_modify_qp(qp, &qpAttr, IBV_QP_STATE | IBV_QP_TIMEOUT | IBV_QP_RETRY_CNT | IBV_QP_RNR_RETRY | IBV_QP_SQ_PSN | IBV_QP_MAX_QP_RD_ATOMIC));
  return ncclSuccess;
}

// 使用 socket 通信辅助 ib 完成初始化
// 所以要启动 socket server
ncclResult_t ncclIbListen(int dev, void* opaqueHandle, void** listenComm) {
  struct ncclIbListenComm* comm;
  NCCLCHECK(ncclCalloc(&comm, 1));
  struct ncclIbHandle* handle = (struct ncclIbHandle*) opaqueHandle;
  static_assert(sizeof(struct ncclIbHandle) < NCCL_NET_HANDLE_MAXSIZE, "ncclIbHandle size too large");
  memset(handle, 0, sizeof(struct ncclIbHandle));
  comm->dev = dev;
  comm->sock.asyncFlag = 1; /* nonblocking socket is required by network communication. */
  NCCLCHECK(GetSocketAddr(&comm->sock.addr));
  NCCLCHECK(ncclSocketListen(&comm->sock));
  memcpy(&handle->connectAddr, &comm->sock.addr, sizeof(union ncclSocketAddress));
  *listenComm = comm;
  return ncclSuccess;
}

// IB 连接
// 使用 socket 发送本端 ib 相关信息
ncclResult_t ncclIbConnect(int dev, void* opaqueHandle, void** sendComm) {
  struct ncclIbHandle* handle = (struct ncclIbHandle*) opaqueHandle;
  enum ncclSocketState conState;
  struct ncclIbCommStage* stage = &handle->stage;
  struct ncclIbSendComm* comm = (struct ncclIbSendComm*)stage->comm;
  *sendComm = NULL;

  if (stage->state == ncclIbCommStateConnect) goto ib_connect_check;
  if (stage->state == ncclIbCommStateSend) goto ib_send;
  if (stage->state != ncclIbCommStateStart) {
    WARN("Error: trying to connect already connected sendComm");
    return ncclInternalError;
  }

  NCCLCHECK(ncclIbMalloc((void**)&comm, sizeof(struct ncclIbSendComm)));
  NCCLCHECK(ncclSocketInit(&comm->sock, &handle->connectAddr, NULL, 1));
  stage->comm = comm;
  stage->state = ncclIbCommStateConnect;
  NCCLCHECK(ncclSocketConnect(&comm->sock));

ib_connect_check:
  /* since ncclSocketConnect is async, we must check if connection is complete */
  NCCLCHECK(ncclGetSocketState(&comm->sock, &conState));
  if (conState == ncclSocketConnecting) {
    /* expect user to call again */
    return ncclSuccess;
  } else if (conState == ncclSocketError) {
    return ncclRemoteError;
  }

  // IB Setup
  struct ibv_context* ctx;
  ctx = ncclIbDevs[dev].context;
  NCCLCHECK(ncclIbInitVerbs(dev, ctx, &comm->verbs));
  uint8_t ib_port;
  ib_port = ncclIbDevs[dev].port;
  comm->nqps = ncclParamIbQpsPerConn();

  // 默认创建 1 qp
  for (int q=0; q<comm->nqps; q++) {
    NCCLCHECK(ncclIbCreateQp(ib_port, &comm->verbs, IBV_ACCESS_REMOTE_WRITE, comm->qps+q));
  }

  // Send my QP Info to receiver through the socket. Hope this won't block.
  struct ibv_port_attr portAttr;
  NCCLCHECK(wrap_ibv_query_port(ctx, ib_port, &portAttr));
  struct ncclIbQpInfo qpInfo;
  qpInfo.ib_port = ib_port;
  for (int q=0; q<comm->nqps; q++) qpInfo.qpn[q] = comm->qps[q]->qp_num;
  qpInfo.mtu = portAttr.active_mtu;

  // Prepare my fifo
  NCCLCHECK(wrap_ibv_reg_mr(&comm->fifoMr, comm->verbs.pd, comm->fifo, sizeof(struct ncclIbSendFifo)*MAX_REQUESTS*NCCL_NET_IB_MAX_RECVS, IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ));
  qpInfo.fifoRkey = comm->fifoMr->rkey;
  qpInfo.fifoAddr = (uint64_t)comm->fifo;

  // RoCE support
  qpInfo.lid = portAttr.lid;
  qpInfo.link_layer = portAttr.link_layer;
  if (qpInfo.link_layer == IBV_LINK_LAYER_INFINIBAND) { // IB
    for (int q=0; q<comm->nqps; q++)
      INFO(NCCL_NET,"NET/IB: Dev %d Port %d qpn %d mtu %d LID %d", dev, ib_port, qpInfo.qpn[q], qpInfo.mtu, qpInfo.lid);
  } else { // RoCE
    union ibv_gid gid;
    NCCLCHECK(wrap_ibv_query_gid(ctx, ib_port, ncclParamIbGidIndex(), &gid));
    qpInfo.spn = gid.global.subnet_prefix;
    qpInfo.iid = gid.global.interface_id;
    for (int q=0; q<comm->nqps; q++)
      INFO(NCCL_NET,"NET/IB: Dev %d Port %d qpn %d mtu %d GID %ld (%lX/%lX)", dev, ib_port, qpInfo.qpn[q], qpInfo.mtu, ncclParamIbGidIndex(), qpInfo.spn, qpInfo.iid);
  }

  stage->state = ncclIbCommStateSend;
  stage->offset = 0;
  NCCLCHECK(ncclIbMalloc((void**)&stage->buffer, sizeof(qpInfo)));
  memcpy(stage->buffer, &qpInfo, sizeof(qpInfo));

ib_send:
  NCCLCHECK(ncclSocketProgress(NCCL_SOCKET_SEND, &comm->sock, stage->buffer, sizeof(qpInfo), &stage->offset));
  if (stage->offset != sizeof(qpInfo))
    return ncclSuccess;

  free(stage->buffer);
  stage->state = ncclIbCommStateConnected;
  *sendComm = comm;
  return ncclSuccess;
}

NCCL_PARAM(IbGdrFlushDisable, "GDR_FLUSH_DISABLE", 0);

// 接收对端 ib 信息
ncclResult_t ncclIbAccept(void* listenComm, void** recvComm) {
  struct ncclIbListenComm* lComm = (struct ncclIbListenComm*)listenComm;
  struct ncclIbCommStage* stage = &lComm->stage;
  struct ncclIbRecvComm* rComm = (struct ncclIbRecvComm*)stage->comm;
  *recvComm = NULL;

  if (stage->state == ncclIbCommStateAccept) goto ib_accept;
  if (stage->state == ncclIbCommStateRecv) goto ib_recv;
  if (stage->state == ncclIbCommStateSend) goto ib_send;
  if (stage->state != ncclIbCommStateStart) {
    WARN("Listencomm in unknown state %d\n", stage->state);
    return ncclInternalError;
  }

  NCCLCHECK(ncclIbMalloc((void**)&rComm, sizeof(struct ncclIbRecvComm)));
  stage->comm = rComm;
  stage->state = ncclIbCommStateAccept;
  NCCLCHECK(ncclSocketInit(&rComm->sock, NULL, lComm->sock.abortFlag, 1));

ib_accept:
  NCCLCHECK(ncclSocketAccept(&rComm->sock, &lComm->sock));
  if (rComm->sock.fd == -1)
    return ncclSuccess;

  struct ncclIbQpInfo remQpInfo;
  stage->state = ncclIbCommStateRecv;
  stage->offset = 0;
  NCCLCHECK(ncclIbMalloc((void**)&stage->buffer, sizeof(remQpInfo)));
ib_recv:
  NCCLCHECK(ncclSocketProgress(NCCL_SOCKET_RECV, &rComm->sock, stage->buffer, sizeof(remQpInfo), &stage->offset));
  if (stage->offset != sizeof(remQpInfo))
    return ncclSuccess;

  /* copy back the received info */
  memcpy(&remQpInfo, stage->buffer, sizeof(struct ncclIbQpInfo));

  // IB setup
  struct ibv_context* ctx;
  uint8_t ib_port;
  ctx = ncclIbDevs[lComm->dev].context;
  ib_port = ncclIbDevs[lComm->dev].port;
  struct ibv_port_attr portAttr;
  NCCLCHECK(wrap_ibv_query_port(ctx, ib_port, &portAttr));
  union ibv_gid gid;
  NCCLCHECK(wrap_ibv_query_gid(ctx, ib_port, ncclParamIbGidIndex(), &gid));

  // QP Creation
  NCCLCHECK(ncclIbInitVerbs(lComm->dev, ctx, &rComm->verbs));
  rComm->nqps = ncclParamIbQpsPerConn();
  for (int q=0; q<rComm->nqps; q++) {
    NCCLCHECK(ncclIbCreateQp(ib_port, &rComm->verbs, IBV_ACCESS_REMOTE_WRITE, rComm->qps+q));
  }

  // Adjust the MTU
  remQpInfo.mtu = (enum ibv_mtu)std::min(remQpInfo.mtu, portAttr.active_mtu);

  // Setup QP
  for (int q=0; q<rComm->nqps; q++) {
    struct ibv_qp* qp = rComm->qps[q];
    NCCLCHECK(ncclIbRtrQp(qp, remQpInfo.qpn[q], &remQpInfo));
    NCCLCHECK(ncclIbRtsQp(qp));
  }

  // Retain remote fifo info and prepare my RDMA ops
  rComm->remFifo.rkey = remQpInfo.fifoRkey;
  rComm->remFifo.addr = remQpInfo.fifoAddr;
  NCCLCHECK(wrap_ibv_reg_mr(&rComm->remFifo.mr, rComm->verbs.pd, &rComm->remFifo.elems, sizeof(struct ncclIbSendFifo)*MAX_REQUESTS*NCCL_NET_IB_MAX_RECVS, IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_READ));
  rComm->remFifo.sge.lkey = rComm->remFifo.mr->lkey;
  if (ncclParamIbUseInline()) rComm->remFifo.flags = IBV_SEND_INLINE;

  // 如果开启了 gdr flush, 则创建独立的 qp, 用于 gdr flush 通信
  // Allocate Flush dummy buffer for GPU Direct RDMA
  rComm->gpuFlush.enabled = (ncclIbGdrSupport(lComm->dev) == 0) && (ncclParamIbGdrFlushDisable() == 0) ? 1 : 0;
  if (rComm->gpuFlush.enabled) {
    NCCLCHECK(wrap_ibv_reg_mr(&rComm->gpuFlush.hostMr, rComm->verbs.pd, &rComm->gpuFlush.hostMem, sizeof(int), IBV_ACCESS_LOCAL_WRITE));
    rComm->gpuFlush.sge.addr = (uint64_t)&rComm->gpuFlush.hostMem;
    rComm->gpuFlush.sge.length = 1;
    rComm->gpuFlush.sge.lkey = rComm->gpuFlush.hostMr->lkey;
    NCCLCHECK(ncclIbCreateQp(ib_port, &rComm->verbs, IBV_ACCESS_LOCAL_WRITE | IBV_ACCESS_REMOTE_READ, &rComm->gpuFlush.qp));
    struct ncclIbQpInfo localQpInfo;
    localQpInfo.lid=portAttr.lid;
    localQpInfo.link_layer=portAttr.link_layer;
    localQpInfo.ib_port=ib_port;
    localQpInfo.spn=gid.global.subnet_prefix;
    localQpInfo.iid=gid.global.interface_id;
    localQpInfo.mtu=portAttr.active_mtu;
    NCCLCHECK(ncclIbRtrQp(rComm->gpuFlush.qp, rComm->gpuFlush.qp->qp_num, &localQpInfo));
    NCCLCHECK(ncclIbRtsQp(rComm->gpuFlush.qp));
  }

  // Fill Handle
  struct ncclIbQpInfo qpInfo;
  qpInfo.lid=portAttr.lid;
  qpInfo.link_layer=portAttr.link_layer;
  qpInfo.ib_port=ib_port;
  for (int q=0; q<rComm->nqps; q++) qpInfo.qpn[q]=rComm->qps[q]->qp_num;
  qpInfo.spn=gid.global.subnet_prefix;
  qpInfo.iid=gid.global.interface_id;
  qpInfo.mtu=remQpInfo.mtu;

  stage->state = ncclIbCommStateSend;
  stage->offset = 0;
  if (stage->buffer) free(stage->buffer);
  NCCLCHECK(ncclIbMalloc((void**)&stage->buffer, sizeof(struct ncclIbQpInfo)));
  memcpy(stage->buffer, &qpInfo, sizeof(struct ncclIbQpInfo));
ib_send:
  NCCLCHECK(ncclSocketProgress(NCCL_SOCKET_SEND, &rComm->sock, stage->buffer, sizeof(struct ncclIbQpInfo), &stage->offset));
  if (stage->offset < sizeof(struct ncclIbQpInfo)) return ncclSuccess;

  free(stage->buffer);
  *recvComm = rComm;

  /* reset lComm stage */
  stage->state = ncclIbCommStateStart;
  stage->offset = 0;
  stage->comm = NULL;
  stage->buffer = NULL;
  return ncclSuccess;
}

// 获取 request 用于通信
ncclResult_t ncclIbGetRequest(struct ncclIbVerbs* verbs, struct ncclIbRequest** req) {
  for (int i=0; i<MAX_REQUESTS; i++) {
    struct ncclIbRequest* r = verbs->reqs+i;
    if (r->type == NCCL_NET_IB_REQ_UNUSED) {
      r->verbs = verbs;
      r->events = 1;
      r->addr = NULL;
      *req = r;
      return ncclSuccess;
    }
  }
  WARN("NET/IB : unable to allocate requests");
  *req = NULL;
  return ncclInternalError;
}

// 还回 request
ncclResult_t ncclIbFreeRequest(struct ncclIbRequest* r) {
  r->type = NCCL_NET_IB_REQ_UNUSED;
  return ncclSuccess;
}

// ib 通信前初始化好 qp
ncclResult_t ncclSendCheck(struct ncclIbSendComm* comm) {
  struct ncclIbQpInfo remQpInfo;

  // 接收远端 qp info: remQpInfo -> remoteQpInfo
  // Do not block on this receive, return if not ready.
  int bytes = 0;
  NCCLCHECK(ncclSocketProgress(NCCL_SOCKET_RECV, &comm->sock, &remQpInfo, sizeof(remQpInfo), &bytes));
  if (bytes == 0) return ncclSuccess; // Try again later

  NCCLCHECK(ncclSocketWait(NCCL_SOCKET_RECV, &comm->sock, &remQpInfo, sizeof(remQpInfo), &bytes));

  // 修改本端 qp 信息, 进行 qp 状态转换
  // 类似于本端 qp 要与对端通信 qp 的地址填入
  for (int q=0; q<comm->nqps; q++) {
    struct ibv_qp* qp = comm->qps[q];
    // 所以如果此时 roce 网卡网络不通 (无法 ping 通 roce 网卡的 ip 地址), 那么就会出现常见错误之一 ibv_modify_qp failed with error Connection timed out
    NCCLCHECK(ncclIbRtrQp(qp, remQpInfo.qpn[q], &remQpInfo));
    NCCLCHECK(ncclIbRtsQp(qp));
  }
  comm->ready = 1;

  // 使用 socket 通信，通知对端 qp 已经准备好
  // Block until this is done. It *should* not block indefinitely.
  NCCLCHECK(ncclSocketSend(&comm->sock, &comm->ready, sizeof(int)));

  return ncclSuccess;
}

// 接收对端 ready 情况
// 等待对端 comm->ready
ncclResult_t ncclRecvCheck(struct ncclIbRecvComm* comm) {
  // Do not block on this receive, return if not ready.
  int bytes = 0;
  NCCLCHECK(ncclSocketProgress(NCCL_SOCKET_RECV, &comm->sock, &comm->ready, sizeof(int), &bytes));
  if (bytes == 0) return ncclSuccess; // Try again later

  NCCLCHECK(ncclSocketWait(NCCL_SOCKET_RECV, &comm->sock, &comm->ready, sizeof(int), &bytes));
  return ncclSuccess;
}

// 这放的位置有点儿突然 ...
ncclResult_t ncclIbTest(void* request, int* done, int* size);

/* DMA-BUF support */
ncclResult_t ncclIbRegMrDmaBuf(void* comm, void* data, size_t size, int type, uint64_t offset, int fd, void** mhandle) {
  static_assert(offsetof(struct ncclIbSendComm, verbs) == offsetof(struct ncclIbRecvComm, verbs), "Send and recv comms must have verbs at the same offset");
  assert(size > 0);

  static __thread uintptr_t pageSize = 0;
  if (pageSize == 0) pageSize = sysconf(_SC_PAGESIZE);

  struct ncclIbVerbs* verbs = (struct ncclIbVerbs*)comm;
  struct ncclIbMrCache* cache = &ncclIbDevs[verbs->dev].mrCache;
  uintptr_t addr = (uintptr_t)data & -pageSize;
  size_t pages = ((uintptr_t)data + size - addr + pageSize-1)/pageSize;
  ncclResult_t res;
  pthread_mutex_lock(&ncclIbDevs[verbs->dev].lock);
  for (int slot=0; /*true*/; slot++) {
    if (slot == cache->population) { // didn't find in cache
      if (cache->population == cache->capacity) { // must grow cache
        cache->capacity = cache->capacity < 32 ? 32 : 2*cache->capacity;
        NCCLCHECKGOTO(ncclRealloc(&cache->slots, cache->population, cache->capacity), res, returning);
      }
      // Deregister / register
      struct ibv_mr* mr;
      unsigned int flags = IBV_ACCESS_LOCAL_WRITE|IBV_ACCESS_REMOTE_WRITE|IBV_ACCESS_REMOTE_READ;
      if (ncclIbRelaxedOrderingEnabled) flags |= IBV_ACCESS_RELAXED_ORDERING;
      if (fd != -1) {
        /* DMA-BUF support */
        NCCLCHECKGOTO(wrap_ibv_reg_dmabuf_mr(&mr, verbs->pd, offset, pages*pageSize, addr, fd, flags), res, returning);
      } else {
        if (ncclIbRelaxedOrderingEnabled) {
          // Use IBVERBS_1.8 API - needed for IBV_ACCESS_RELAXED_ORDERING support
          NCCLCHECKGOTO(wrap_ibv_reg_mr_iova2(&mr, verbs->pd, (void*)addr, pages*pageSize, addr, flags), res, returning);
        }
        else {
          NCCLCHECKGOTO(wrap_ibv_reg_mr(&mr, verbs->pd, (void*)addr, pages*pageSize, flags), res, returning);
        }
      }
      TRACE(NCCL_INIT,"regAddr %llx size %lld rkey %x fd %d", (unsigned long long)addr, (long long)pages*pageSize, mr->rkey, fd);
      cache->population += 1;
      cache->slots[slot].addr = addr;
      cache->slots[slot].pages = pages;
      cache->slots[slot].refs = 1;
      cache->slots[slot].mr = mr;
      *mhandle = (void*)mr;
      res = ncclSuccess;
      goto returning;
    }
    else if (cache->slots[slot].addr == addr && cache->slots[slot].pages == pages) {
      cache->slots[slot].refs += 1;
      *mhandle = (void*)cache->slots[slot].mr;
      res = ncclSuccess;
      goto returning;
    }
  }
returning:
  pthread_mutex_unlock(&ncclIbDevs[verbs->dev].lock);
  return res;
}

// pin (lock) memory
// 这里的常见错误是 docker container 中 memlock 默认较小, 导致 ibv_reg_mr failed
// https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html#infiniband
ncclResult_t ncclIbRegMr(void* comm, void* data, int size, int type, void** mhandle) {
  return ncclIbRegMrDmaBuf(comm, data, (size_t)size, type, 0ULL, -1, mhandle);
}

ncclResult_t ncclIbDeregMr(void* comm, void* mhandle) {
  struct ncclIbVerbs* verbs = (struct ncclIbVerbs*)comm;
  struct ncclIbMrCache* cache = &ncclIbDevs[verbs->dev].mrCache;
  ncclResult_t res;
  pthread_mutex_lock(&ncclIbDevs[verbs->dev].lock);
  for (int i=0; i < cache->population; i++) {
    if (mhandle == cache->slots[i].mr) {
      if (0 == --cache->slots[i].refs) {
        memmove(&cache->slots[i], &cache->slots[--cache->population], sizeof(struct ncclIbMr));
        if (cache->population == 0) {
          free(cache->slots);
          cache->slots = NULL;
          cache->capacity = 0;
        }
        NCCLCHECKGOTO(wrap_ibv_dereg_mr((struct ibv_mr*)mhandle), res, returning);
      }
      res = ncclSuccess;
      goto returning;
    }
  }
  WARN("NET/IB: could not find mr %p inside cache of %d entries", mhandle, cache->population);
  res = ncclInternalError;
returning:
  pthread_mutex_unlock(&ncclIbDevs[verbs->dev].lock);
  return res;
}

ncclResult_t ncclIbMultiSend(struct ncclIbSendComm* comm, int slot) {
  struct ncclIbRequest** reqs = comm->fifoReqs[slot];
  volatile struct ncclIbSendFifo* slots = comm->fifo[slot];
  int nreqs = slots[0].nreqs;
  if (nreqs > NCCL_NET_IB_MAX_RECVS) return ncclInternalError;

  uint64_t wr_id = 0ULL;

  for (int r=0; r<nreqs; r++) {
    struct ibv_send_wr* wr = comm->wrs+r;
    memset(wr, 0, sizeof(struct ibv_send_wr));

    struct ibv_sge* sge = comm->sges+r;
    sge->addr=(uintptr_t)reqs[r]->send.data;
    sge->lkey=reqs[r]->send.lkey;

    wr->opcode = IBV_WR_RDMA_WRITE;
    wr->send_flags = 0;
    wr->wr.rdma.remote_addr = slots[r].addr;
    wr->wr.rdma.rkey = slots[r].rkey;
    wr->next = wr+1;
    wr_id += (reqs[r] - comm->verbs.reqs) << (r*8);
  }

  // Write size as immediate data. In the case of multi-send, only write
  // 0 or 1 as size to indicate whether there was data sent or received.
  uint32_t immData = 0;
  if (nreqs == 1) {
    immData = reqs[0]->send.size;
  } else {
    if (nreqs > 32) {
      WARN("Cannot store sizes of %d requests in a 32-bits field", nreqs);
      return ncclInternalError;
    }
    for (int r=0; r<nreqs; r++) {
      immData |= (reqs[r]->send.size ? 1 : 0) << r;
    }
  }

  struct ibv_send_wr* lastWr = comm->wrs+nreqs-1;
  if (nreqs > 1 || reqs[0]->send.size > ncclParamIbArThreshold()) {
    // When using adaptive routing, send the bulk of the data first as an
    // RDMA_WRITE, then a 0-byte RDMA_WRITE_WITH_IMM to trigger a remote
    // completion.
    lastWr++;
    memset(lastWr, 0, sizeof(struct ibv_send_wr));
  }
  lastWr->wr_id = wr_id;
  lastWr->opcode = IBV_WR_RDMA_WRITE_WITH_IMM;
  lastWr->imm_data = immData;
  lastWr->next = NULL;
  lastWr->send_flags = IBV_SEND_SIGNALED;

  // Multi-QP: make sure IB writes are multiples of 128B so that LL and LL128 protocols still work
  const int align = 128;
  for (int q=0; q<comm->nqps; q++) {
    for (int r=0; r<nreqs; r++) {
      int chunkSize = DIVUP(DIVUP(reqs[r]->send.size, comm->nqps), align) * align;
      int length = std::min(reqs[r]->send.size-reqs[r]->send.offset, chunkSize);
      if (length <= 0) {
        comm->wrs[r].sg_list = NULL;
        comm->wrs[r].num_sge = 0;
      } else {
        comm->sges[r].length = length;
        comm->wrs[r].sg_list = comm->sges+r;
        comm->wrs[r].num_sge = 1;
      }
    }
    struct ibv_send_wr* bad_wr;
    NCCLCHECK(wrap_ibv_post_send(comm->qps[q], comm->wrs, &bad_wr));

    for (int r=0; r<nreqs; r++) {
      int chunkSize = DIVUP(DIVUP(reqs[r]->send.size, comm->nqps), align) * align;
      reqs[r]->send.offset += chunkSize;
      comm->sges[r].addr += chunkSize;
      comm->wrs[r].wr.rdma.remote_addr += chunkSize;
    }
  }

  return ncclSuccess;
}

ncclResult_t ncclIbIsend(void* sendComm, void* data, int size, int tag, void* mhandle, void** request) {
  struct ncclIbSendComm* comm = (struct ncclIbSendComm*)sendComm;

  // 如果未 ready, 则 set up qp
  if (comm->ready == 0) NCCLCHECK(ncclSendCheck(comm));
  if (comm->ready == 0) { *request = NULL; return ncclSuccess; }

  struct ibv_mr* mr = (struct ibv_mr*)mhandle;

  // Wait for the receiver to have posted the corresponding receive
  int nreqs = 0;
  volatile struct ncclIbSendFifo* slots;

  int slot = (comm->fifoHead)%MAX_REQUESTS;
  struct ncclIbRequest** reqs = comm->fifoReqs[slot];
  slots = comm->fifo[slot];
  int idx = comm->fifoHead+1;
  if (slots[0].idx != idx) { *request = NULL; return ncclSuccess; }
  nreqs = slots[0].nreqs;
  // Wait until all data has arrived
  for (int r=1; r<nreqs; r++) while(slots[r].idx != idx);

  __sync_synchronize(); // order the nreqsPtr load against tag/rkey/addr loads below
  for (int r=0; r<nreqs; r++) {
    if (reqs[r] != NULL || slots[r].tag != tag) continue;

    // Sanity checks to catch user collective call count/size mismatches
    if (size > slots[r].size) {
      char line[SOCKET_NAME_MAXLEN+1];
      WARN("NET/IB : req %d/%d tag %x peer %s collective mismatch error, local size %d remote size %d",
           r, nreqs, tag, ncclSocketToString(&comm->sock.addr, line), size, slots[r].size);
      return ncclInvalidUsage;
    } // plus any potential programming errors
    else if (slots[r].size < 0 || slots[r].addr == 0 || slots[r].rkey == 0) {
     char line[SOCKET_NAME_MAXLEN+1];
     WARN("NET/IB : req %d/%d tag %x peer %s posted incorrect receive info: size %d addr %lx rkey %x",
          r, nreqs, tag, ncclSocketToString(&comm->sock.addr, line), slots[r].size, slots[r].addr, slots[r].rkey);
      return ncclInternalError;
    }

    // 发送数据

    struct ncclIbRequest* req;
    NCCLCHECK(ncclIbGetRequest(&comm->verbs, &req));
    req->type = NCCL_NET_IB_REQ_SEND;
    req->addr = &comm->sock.addr;
    req->verbs = &comm->verbs;
    req->nreqs = nreqs;
    req->send.size = size;
    req->send.data = data;
    req->send.lkey = mr->lkey;
    req->send.offset = 0;
    req->addr = &comm->sock.addr;
    req->events = comm->nqps;
    *request = reqs[r] = req;

    // If this is a multi-recv, send only when all requests have matched.
    for (int r=0; r<nreqs; r++) {
      if (reqs[r] == NULL) return ncclSuccess;
    }

    TIME_START(0);
    NCCLCHECK(ncclIbMultiSend(comm, slot));

    // Clear slots[0]->nreqs, as well as other fields to help debugging and sanity checks
    memset((void*)slots, 0, sizeof(struct ncclIbSendFifo));
    memset(reqs, 0, NCCL_NET_IB_MAX_RECVS*sizeof(struct ncclIbRequest*));
    comm->fifoHead++;
    TIME_STOP(0);
    return ncclSuccess;
  }

  *request = NULL;
  return ncclSuccess;
}

// 发送 fifo; fifo 中提供的是 ncclIbSendFifo, 其中有 rkey, 以及等待数据写入的 mem addr
ncclResult_t ncclIbPostFifo(struct ncclIbRecvComm* comm, int n, void** data, int* sizes, int* tags, void** mhandles, struct ncclIbRequest* req) {
  struct ibv_send_wr wr;
  memset(&wr, 0, sizeof(wr));

  int slot = comm->remFifo.fifoTail%MAX_REQUESTS;
  struct ncclIbSendFifo* localElem = comm->remFifo.elems[slot];

  for (int i=0; i<n; i++) {
    localElem[i].addr = (uint64_t)data[i];
    struct ibv_mr* mr = (struct ibv_mr*)mhandles[i];
    localElem[i].rkey = mr->rkey;
    localElem[i].nreqs = n;
    localElem[i].size = sizes[i]; // Sanity/Debugging
    localElem[i].tag = tags[i];
    localElem[i].idx = comm->remFifo.fifoTail+1;
  }

  wr.wr.rdma.remote_addr = comm->remFifo.addr + slot*NCCL_NET_IB_MAX_RECVS*sizeof(struct ncclIbSendFifo);
  wr.wr.rdma.rkey = comm->remFifo.rkey;
  comm->remFifo.sge.addr = (uint64_t)localElem;
  comm->remFifo.sge.length = n*sizeof(struct ncclIbSendFifo);
  wr.sg_list = &comm->remFifo.sge;
  wr.num_sge = 1;
  wr.opcode = IBV_WR_RDMA_WRITE;
  wr.send_flags = comm->remFifo.flags; // IBV_SEND_INLINE

  // We need to occasionally post a request with the IBV_SEND_SIGNALED flag, otherwise
  // the send queue will never empty.
  //
  // From https://www.rdmamojo.com/2014/06/30/working-unsignaled-completions/
  // "How to use Unsignaled Completion?" / "Gotchas and Pitfalls"
  // All posted Send Requested, Signaled and Unsignaled, are considered outstanding until
  // a Work Completion that they, or Send Requests that were posted after them, was polled
  // from the Completion Queue associated with the Send Queue. This means if one works with
  // a Queue Pair that was configured to work with Unsignaled Completions, he must make
  // sure that occasionally (before the Send Queue is full with outstanding Send Requests)
  // a Send Request that generate Work Completion will be posted.
  //
  // Not following this rule may lead to a case that the Send Queue is full with Send
  // Requests that won't generate Work Completion:
  //
  //  - The Send Queue is full, so no new Send Requests can be posted to it 发送队列满了, 那么就没法发一个能生成 work completion 的 send request 了
  //  - The Send Queue can't be emptied, since no Work Completion can be generated anymore
  //    (the reason is that no Work Completion, that can generate Work Completion that
  //    polling it will empty the Send Queue, can be posted)
  //  - The status of all posted Send Request is considered unknown
  //

  // 所有已发送的 signaled/unsignaled send request, 都被认为是未完成的. 他们什么时候算是完成的 ？在他们之后发送的 send requests 从 
  // 关联着 send queue 的 completion queue 被轮询到后
  
  // 说的有点儿绕, 其实核心表达的是这篇材料里边有个 case https://www.openfabrics.org/images/eventpresos/workshops2013/IBUG/2013_UserDay_Thur_1400_Bob-Russell-programming-concepts.pdf
  // rdma ping pong 的例子，核心表达的意思是 write 可以不用 signal 是否完成, 因为 client 侧 write 之后, 随后直接 read, 可以直接等 read 的 cq, 如果 read cq 报错, 如果是之前的 write 报错, 
  // 那么等 read 的 cq, 也会返回相应的报错信息
  //
  // 也就是说如果 send with unsignaled completions, 怎么确认 send 完成了, 一定需要不时的, 注意要在 send queue 未被未完成的 send requests 占满之前,
  // 发送一个能生成 work completion 的 send request
  //
  // 不按上边来的话, 可能会出现 send queue 里边的 send requests 都不会产生 work compeltion, 具体来说可能会出现如下现象
  // 1. send queue 满, 新的请求无法发送
  // 2. send queue 无法为空, 因为没有请求会生成 work competion, 所以也就无法从 send queue 里边清数据. 其实这点和 3 相关, 说的是一个意思
  // 3. 所有已发送的请求, 状态未知.
  //

  // 所以在 NCCL 这里的实现上 slot 0 的数据使用 signaled, 而其他 slot 的数据使用 unsignaled 发送
  if (slot == 0) {
    wr.send_flags |= IBV_SEND_SIGNALED;
    wr.wr_id = req - comm->verbs.reqs;
    req->events++;
  }

  struct ibv_send_wr* bad_wr;
  NCCLCHECK(wrap_ibv_post_send(comm->qps[0], &wr, &bad_wr));
  comm->remFifo.fifoTail++;

  return ncclSuccess;
}

ncclResult_t ncclIbIrecv(void* recvComm, int n, void** data, int* sizes, int* tags, void** mhandles, void** request) {
  struct ncclIbRecvComm* comm = (struct ncclIbRecvComm*)recvComm;
  if (comm->ready == 0) NCCLCHECK(ncclRecvCheck(comm));
  if (comm->ready == 0) { *request = NULL; return ncclSuccess; }
  if (n > NCCL_NET_IB_MAX_RECVS) return ncclInternalError;

  struct ncclIbRequest* req;
  NCCLCHECK(ncclIbGetRequest(&comm->verbs, &req));
  req->type = NCCL_NET_IB_REQ_RECV;
  req->addr = &comm->sock.addr;
  req->nreqs = n;
  for (int i=0; i<n; i++) req->recv.sizes[i] = 0;

  struct ibv_recv_wr wr;
  memset(&wr, 0, sizeof(wr));
  wr.wr_id = req - comm->verbs.reqs;

  wr.sg_list = NULL;
  wr.num_sge = 0;

  TIME_START(1);
  for (int q=0; q<comm->nqps; q++) {
    struct ibv_qp* qp = comm->qps[q];
    struct ibv_recv_wr* bad_wr;
    NCCLCHECK(wrap_ibv_post_recv(qp, &wr, &bad_wr));
  }
  TIME_STOP(1);
  req->events = comm->nqps;

  *request = req;

  // Post to FIFO to notify sender
  TIME_START(2);
  NCCLCHECK(ncclIbPostFifo(comm, n, data, sizes, tags, mhandles, req));
  TIME_STOP(2);
  return ncclSuccess;
}

// 由前边的代码可知, ib flush 逻辑用于 gdr flush, 默认开启
ncclResult_t ncclIbIflush(void* recvComm, int n, void** data, int* sizes, void** mhandles, void** request) {
  struct ncclIbRecvComm* comm = (struct ncclIbRecvComm*)recvComm;
  int last = -1;
  for (int i=0; i<n; i++) if (sizes[i]) last = i;
  if (comm->gpuFlush.enabled == 0 || last == -1) return ncclSuccess;

  // Only flush once using the last non-zero receive
  struct ncclIbRequest* req;
  NCCLCHECK(ncclIbGetRequest(&comm->verbs, &req));
  req->type = NCCL_NET_IB_REQ_FLUSH;
  req->addr = &comm->sock.addr;
  struct ibv_mr* mr = (struct ibv_mr*)mhandles[last];

  struct ibv_send_wr wr;
  memset(&wr, 0, sizeof(wr));
  wr.wr_id = req - comm->verbs.reqs;

  wr.wr.rdma.remote_addr = (uint64_t)data[last];
  wr.wr.rdma.rkey = mr->rkey;
  wr.sg_list = &comm->gpuFlush.sge;
  wr.num_sge = 1;
  wr.opcode = IBV_WR_RDMA_READ;
  wr.send_flags = IBV_SEND_SIGNALED;

  TIME_START(4);
  struct ibv_send_wr* bad_wr;
  NCCLCHECK(wrap_ibv_post_send(comm->gpuFlush.qp, &wr, &bad_wr));
  TIME_STOP(4);

  *request = req;
  return ncclSuccess;
}

// 确认数据收发是否完成
// 这里的常见错误是网络故障
// NCCL WARN NET/IB : Got completion with error 12, opcode 0, len 0, vendor err 129
// IBV_WC_RETRY_EXC_ERR (12) - Transport Retry Counter Exceeded: The local transport timeout retry counter was exceeded while trying to send this message. This means that the remote side didn't send any Ack or Nack. If this happens when sending the first message, usually this mean that the connection attributes are wrong or the remote side isn't in a state that it can respond to messages. If this happens after sending the first message, usually it means that the remote QP isn't available anymore. Relevant for RC QPs.
// https://github.com/NVIDIA/nccl/issues/426
// 如果是训练启动时就报 error 12, 那多半是 qp 参数配置问题; 如果是训练过程中报错, 有可能是代码问题, 若相信 NCCL 的代码质量, 那当然更大概率是网络问题, 比如 roce v2 网络下, 对端网卡, 或者是交换机出现丢包, 导致一直未回复 ack, 最终导致超时发生
ncclResult_t ncclIbTest(void* request, int* done, int* sizes) {
  struct ncclIbRequest *r = (struct ncclIbRequest*)request;
  *done = 0;

  while (1) {
    if (r->events == 0) {
      *done = 1;
      if (sizes && r->type == NCCL_NET_IB_REQ_RECV) {
        for (int i=0; i<r->nreqs; i++) sizes[i] = r->recv.sizes[i];
      }
      NCCLCHECK(ncclIbFreeRequest(r));
      return ncclSuccess;
    }

    int wrDone = 0;
    struct ibv_wc wcs[4];
    TIME_START(3);
    NCCLCHECK(wrap_ibv_poll_cq(r->verbs->cq, 4, wcs, &wrDone));
    if (wrDone == 0) { TIME_CANCEL(3); } else { TIME_STOP(3); }
    if (wrDone == 0) return ncclSuccess;

    for (int w=0; w<wrDone; w++) {
      struct ibv_wc *wc = wcs+w;

      // https://www.rdmamojo.com/2013/02/15/ibv_poll_cq/
      // Not all wc attributes are always valid. If the completion status is other than IBV_WC_SUCCESS, only the following attributes are valid:
      // wr_id
      // status
      // qp_num
      // vendor_err

      if (wc->status != IBV_WC_SUCCESS) {
        char line[SOCKET_NAME_MAXLEN+1];
        WARN("NET/IB : Got completion from peer %s with error %d, opcode %d, len %d, vendor err %d",
             ncclSocketToString(r->addr, line), wc->status, wc->opcode, wc->byte_len, wc->vendor_err); // 这里增加打印 qp num 会更方便 trace
        return ncclRemoteError;
      }

      struct ncclIbRequest* req = r->verbs->reqs+(wc->wr_id & 0xff);      
      if (req->type == NCCL_NET_IB_REQ_SEND) {
        for (int i=0; i<req->nreqs; i++) {
          struct ncclIbRequest* sendReq = r->verbs->reqs+((wc->wr_id >> (i*8)) & 0xff);
          if ((sendReq->events <= 0)) return ncclInternalError;
          sendReq->events--;
        }
      } else {
        if (req && wc->opcode == IBV_WC_RECV_RDMA_WITH_IMM) {
          if (req->type != NCCL_NET_IB_REQ_RECV) return ncclInternalError;
          if (req->nreqs > 1) {
            // In the case of a multi recv, we only set sizes to 0 or 1.
            for (int i=0; i<req->nreqs; i++) {
              req->recv.sizes[i] = (wc->imm_data >> i) & 0x1;
            }
          } else {
            req->recv.sizes[0] += wc->imm_data;
          }
        }
        req->events--;
      }
    }
  }
}

ncclResult_t ncclIbCloseSend(void* sendComm) {
  struct ncclIbSendComm* comm = (struct ncclIbSendComm*)sendComm;
  if (comm) {
    close(comm->sock.fd);
    for (int q=0; q<comm->nqps; q++)
      if (comm->qps[q] != NULL) NCCLCHECK(wrap_ibv_destroy_qp(comm->qps[q]));
    if (comm->fifoMr != NULL) NCCLCHECK(wrap_ibv_dereg_mr(comm->fifoMr));
    NCCLCHECK(ncclIbDestroyVerbs(&comm->verbs));
    free(comm);
  }
  TIME_PRINT("IB");
  return ncclSuccess;
}

ncclResult_t ncclIbCloseRecv(void* recvComm) {
  struct ncclIbRecvComm* comm = (struct ncclIbRecvComm*)recvComm;
  if (comm) {
    close(comm->sock.fd);
    for (int q=0; q<comm->nqps; q++)
      if (comm->qps[q] != NULL) NCCLCHECK(wrap_ibv_destroy_qp(comm->qps[q]));
    if (comm->gpuFlush.enabled) {
      if (comm->gpuFlush.qp != NULL) NCCLCHECK(wrap_ibv_destroy_qp(comm->gpuFlush.qp));
      if (comm->gpuFlush.hostMr != NULL) NCCLCHECK(wrap_ibv_dereg_mr(comm->gpuFlush.hostMr));
    }
    if (comm->remFifo.mr != NULL) NCCLCHECK(wrap_ibv_dereg_mr(comm->remFifo.mr));
    NCCLCHECK(ncclIbDestroyVerbs(&comm->verbs));
    free(comm);
  }
  return ncclSuccess;
}

ncclResult_t ncclIbCloseListen(void* listenComm) {
  struct ncclIbListenComm* comm = (struct ncclIbListenComm*)listenComm;
  if (comm) {
    close(comm->sock.fd);
    free(comm);
  }
  return ncclSuccess;
}

ncclNet_t ncclNetIb = {
  "IB",
  ncclIbInit,
  ncclIbDevices,
  ncclIbGetProperties,
  ncclIbListen,
  ncclIbConnect,
  ncclIbAccept,
  ncclIbRegMr,
  ncclIbRegMrDmaBuf,
  ncclIbDeregMr,
  ncclIbIsend,
  ncclIbIrecv,
  ncclIbIflush,
  ncclIbTest,
  ncclIbCloseSend,
  ncclIbCloseRecv,
  ncclIbCloseListen
};

pytorch 1.13 and nccl

发表于 2023-05-03 更新于 2023-05-07

本文字数： 3.4k 阅读时长 ≈ 3 分钟

windows 11
wsl2
- ubuntu 18.04
- nvidia driver 531.68
- cuda 11.6.2

pytorch 1.13.1 docker image

docker image

1	docker pull pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime

view nccl version of pytorch

1
2
3

docker run -ti --rm pytorch/pytorch:1.13.1-cuda11.6-cudnn8-runtime bash

python -c "import torch;print(torch.cuda.nccl.version())"

pytorch 1.13.1

https://github.com/pytorch/pytorch/tree/v1.13.1

https://github.com/pytorch/pytorch/tree/v1.13.1#from-source

https://github.com/pytorch/pytorch/blob/v1.13.1/CONTRIBUTING.md#tips-and-debugging

https://zrss.github.io/archives/5a3d0ab7.html

conda create -n pytorch-dev python=3.8

conda activate pytorch-dev

conda install astunparse numpy ninja pyyaml setuptools cmake cffi typing_extensions future six requests dataclasses
conda install -c pytorch magma-cuda116
conda install mkl mkl-include

export CMAKE_PREFIX_PATH=${CONDA_PREFIX:-"$(dirname $(which conda))/../"}

CUDACXX=/usr/local/cuda/bin/nvcc MAX_JOBS=8 python setup.py develop

如果 setup 过程中出现如下日志

Building wheel torch-1.13.0a0+git49444c3
-- Building version 1.13.0a0+git49444c3
Could not find any of CMakeLists.txt, Makefile, setup.py, LICENSE, LICENSE.md, LICENSE.txt in /root/projects/pytorch/third_party/ios-cmake
Did you run 'git submodule update --init --recursive --jobs 0'?

可以重新 update submodule，再做尝试

git submodule deinit -f .
git clean -xdf
python setup.py clean
git submodule update --init --recursive --jobs 0

如果 setup 过程中出现如下日志，可以减小 jobs 数（例如上述的 case 为 8），再做尝试

1
2
3

FAILED: third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o
/usr/bin/c++ -DFBGEMM_STATIC -I/root/projects/pytorch/third_party/cpuinfo/include -I/root/projects/pytorch/third_party/fbgemm/third_party/asmjit/src -I/root/projects/pytorch/third_party/fbgemm/include -I/root/projects/pytorch/third_party/fbgemm -I/root/projects/pytorch/cmake/../third_party/benchmark/include -isystem /root/projects/pytorch/cmake/../third_party/googletest/googlemock/include -isystem /root/projects/pytorch/cmake/../third_party/googletest/googletest/include -isystem /root/projects/pytorch/third_party/protobuf/src -isystem /root/tools/miniconda3/envs/pytorch-dev/include -isystem /root/projects/pytorch/third_party/gemmlowp -isystem /root/projects/pytorch/third_party/neon2sse -isystem /root/projects/pytorch/third_party/XNNPACK/include -Wno-deprecated -fvisibility-inlines-hidden -DUSE_PTHREADPOOL -fopenmp -Wall -Wextra -Werror -Wno-deprecated-declarations -O3 -DNDEBUG -fPIC -fvisibility=hidden -m64 -mavx2 -mf16c -mfma -std=c++14 -Wno-uninitialized -MD -MT third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o -MF third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o.d -o third_party/fbgemm/CMakeFiles/fbgemm_avx2.dir/src/FbgemmI8DepthwiseAvx2.cc.o -c /root/projects/pytorch/third_party/fbgemm/src/FbgemmI8DepthwiseAvx2.cc
c++: internal compiler error: Killed (program cc1plus)

nccl v2.14.3-1

https://github.com/NVIDIA/nccl/tree/v2.14.3-1

make (generate header)

1 2	cd nccl make -j 8 src.build

如果 make 过程中出现如下日志，可以减小 make 所使用的核数（例如上述的 case 为 8 核），再做尝试

1	g++: internal compiler error: Killed (program cc1plus)

将 build/include/nccl.h 文件拷贝至 src 目录下

vscode nccl in wsl

config wsl for vscode

https://code.visualstudio.com/docs/cpp/config-wsl

install cuda toolkit in wsl

https://docs.nvidia.com/cuda/wsl-user-guide/index.html#cuda-support-for-wsl-2

https://developer.nvidia.com/cuda-11-6-2-download-archive?target_os=Linux&target_arch=x86_64&Distribution=WSL-Ubuntu&target_version=2.0&target_type=deb_local

add includePath for vscode

/usr/local/cuda/targets/x86_64-linux/include/

wsl2 docker

发表于 2023-05-03 更新于 2023-05-07

本文字数： 964 阅读时长 ≈ 1 分钟

参考 https://docs.docker.com/engine/install/binaries/#install-static-binaries 使用二进制方式安装 docker engine 18.09.9
参考 https://learn.microsoft.com/en-us/windows/wsl/wsl-config#systemd-support 开启 wsl2 支持 systemd
参考 https://docs.docker.com/engine/install/linux-postinstall/#configure-docker-to-start-on-boot-with-systemd 配置 docker engine systemd, systemd 配置文件 https://github.com/moby/moby/tree/master/contrib/init/systemd
参考 https://docs.docker.com/config/daemon/systemd/#httphttps-proxy 配置 docker daemon proxy

完成上述配置后发现 docker info 非常慢，并且尝试 docker run 容器镜像会有如下报错

docker: Error response from daemon: all SubConns are in TransientFailure, latest connection error: connection error: desc = "transport: Error while dialing dial unix:///run/containerd/containerd.sock: timeout": unavailable.

参考 https://github.com/sous-chefs/docker/issues/1062 发现疑似 master 分支的 systemd 配置引入了不兼容修改，导致使用 master 分支的 systemd 配置，无法完全启动 docker engine 18.09.9。修改 systemd 配置为 https://github.com/moby/moby/tree/v18.09.9/contrib/init/systemd 后，docker info 以及 docker run 功能恢复正常

golang handle signals

发表于 2022-12-25

本文字数： 1.6k 阅读时长 ≈ 1 分钟

https://pkg.go.dev/os/signal#hdr-Default_behavior_of_signals_in_Go_programs

https://pkg.go.dev/os/signal#hdr-Changing_the_behavior_of_signals_in_Go_programs

By default, a synchronous signal is converted into a run-time panic. A SIGHUP, SIGINT, or SIGTERM signal causes the program to exit.

Notify disables the default behavior for a given set of asynchronous signals and instead delivers them over one or more registered channels. Specifically, it applies to the signals SIGHUP, SIGINT, SIGQUIT, SIGABRT, and SIGTERM.

但是别忘了会有 race 的情况, 下边通过 bash shell 脚本来启动 golang 进程做一个示例

test-signal

package main

import (
	"fmt"
	"os"
	"os/signal"
	"syscall"
	"time"
)

func main() {
	signalCh := make(chan os.Signal, 2)
	signal.Notify(signalCh, syscall.SIGINT, syscall.SIGTERM)
	fmt.Printf("notify signals\n")

	go func() {
		sig := <- signalCh
		fmt.Printf("receive signal %v\n", sig)
	}()

	fmt.Printf("wait signal\n")
	time.Sleep(time.Minute)
}

test1.sh

./test-signal &
pid=$!

echo "test-signal pid: $pid"

kill $pid
wait $pid

exit_code=$?
echo "test-signal exit_code: $exit_code"

test2.sh

./test-signal &
pid=$!

echo "test-signal pid: $pid"

# important
sleep 1
#

kill $pid
wait $pid

exit_code=$?
echo "test-signal exit_code: $exit_code"

test1.sh 的执行结果

1
2
3

test-signal pid: 4878
test1.sh: line 7:  4878 Terminated: 15          ./test-signal
test-signal exit_code: 143

test2.sh 的执行结果

test-signal pid: 4880
notify signals
wait signal
receive signal terminated

Summary

golang 程序处理 TERM 信号的默认行为是退出, 且退出码为 143 (128 + 15), 15 为 TERM
使用 signal.Notify 可以修改 golang 程序处理 TERM 信号的默认行为; 但是如果 golang 程序启动后过快接收到 TERM 信号 (在 signal.Notify 执行完成之前), 则会导致程序直接退出 (默认行为)

gorm v1 logger

发表于 2022-12-17

本文字数： 2.2k 阅读时长 ≈ 2 分钟

https://v1.gorm.io/docs/

https://v1.gorm.io/docs/logger.html

Refer GORM’s default logger for how to customize it

https://github.com/jinzhu/gorm/blob/v1.9.16/logger.go

gorm v1 print log

func (s *DB) print(v ...interface{}) {
	s.logger.Print(v...)
}

func (s *DB) log(v ...interface{}) {
	if s != nil && s.logMode == detailedLogMode {
		s.print(append([]interface{}{"log", fileWithLineNum()}, v...)...)
	}
}

func (s *DB) slog(sql string, t time.Time, vars ...interface{}) {
	if s.logMode == detailedLogMode {
		s.print("sql", fileWithLineNum(), NowFunc().Sub(t), sql, vars, s.RowsAffected)
	}
}

gorm v1 print error

// AddError add error to the db
func (s *DB) AddError(err error) error {
	if err != nil {
		if err != ErrRecordNotFound {
			if s.logMode == defaultLogMode {
				go s.print("error", fileWithLineNum(), err)
			} else {
				s.log(err)
			}

			errors := Errors(s.GetErrors())
			errors = errors.Add(err)
			if len(errors) > 1 {
				err = errors
			}
		}

		s.Error = err
	}
	return err
}

gorm v1 print sql

// trace print sql log
func (scope *Scope) trace(t time.Time) {
	if len(scope.SQL) > 0 {
		scope.db.slog(scope.SQL, t, scope.SQLVars...)
	}
}

因此在打开 gorm v1 LogMode 的时候

// LogMode set log mode, `true` for detailed logs, `false` for no log, default, will only print error logs
func (s *DB) LogMode(enable bool) *DB {
	if enable {
		s.logMode = detailedLogMode
	} else {
		s.logMode = noLogMode
	}
	return s
}

会进入到 s.print log, s.print sql 的打印逻辑

https://www.soberkoder.com/go-gorm-logging/

如若需要自定义 gorm v1 logger 可以参考如下代码段

// GormLogger struct
type GormLogger struct{}

// Print - Log Formatter
func (*GormLogger) Print(v ...interface{}) {
  if v[0] == "sql" {
    log.WithFields(
      log.Fields{
        "module":        "gorm",
        "type":          "sql",
        "rows_returned": v[5],
        "src":           v[1],
        //"values":        v[4],
        "duration":      v[2],
      },
    ).Info(v[3])
  } else {
    log.WithFields(log.Fields{"module": "gorm", "type": "log", "src": v[1]}).Print(v[2:]...)
  }
}

另外也可以根据 duration 实现客户端的 slow sql 打印