安装依赖
如果ib网卡不识别,或着识别后配置后无法启用,安装ib驱动
yum install -y infiniband-diags
yum install -y opensm
systemctl start opensm
systemctl enable opensm
后期编译mvapich时需要ib库,
yum install -y libibverbs
yum install -y libibverbs-devel
yum install -y libibmad-devel
计算节点也要安装这些库,如果计算节点没装,管理节点安装了编译的程序,在管理节点编译安装没有问题,提交作业就会报错,如
/home/users/cndaqiang/soft/gnu4-mvapich/R-TDAP/Obj/tdap: error while loading shared libraries: libibmad.so.5: cannot open shared object file: No such file or directory
/home/users/cndaqiang/soft/gnu4-mvapich/R-TDAP/Obj/tdap: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
srun: error: hpcc045: tasks 0-35: Exited with exit code 127
网络配置
同Linux普通Eth网卡配置
vi /etc/sysconfig/network-scripts/ifcfg-ib0
CONNECTED_MODE=no
TYPE=InfiniBand
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ib0
UUID=3491f03d-656b-45a3-b66f-fd3c6b6d6968
DEVICE=ib0
ONBOOT=yes
IPADDR=172.16.100.7
PREFIX=24
如果没有该文件,新建时查询UUID方法
yum -y install NetworkManager
service NetworkManager start
[cndaqiang@master ~]$ nmcli con
NAME UUID TYPE DEVICE
eno1 5e7bcf02-9afa-4bfc-9712-6bdf6126ae58 ethernet eno1
ib0 3491f03d-656b-45a3-b66f-fd3c6b6d6968 infiniband ib0
virbr0 a945b5c5-ec6d-4ed5-9a97-d7aa725f8f21 bridge virbr0
重启网络
systemctl restart network
两台集群IB直连
可以配置,和上面相同,只要配置好ip,IB线直接插在两个机器的IB口即可
但是一台机器关机,另一台机器的IB网卡就断开了,在IB网卡上ip就没法用了,基于该ip配置的服务全部崩,直到另一台机器上线才能恢复
还是采用IB交换机可玩性更高
IB相关配置
查询信息
ibnodes
ibstatus
报错
[cndaqiang@master ~]$ ibnodes
ibwarn: [5809] mad_rpc_open_port: can't open UMAD port ((null):0)
src/ibnetdisc.c:788; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
ibwarn: [5814] mad_rpc_open_port: can't open UMAD port ((null):0)
src/ibnetdisc.c:788; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
切换root身份即可
[cndaqiang@master ~]$ sudo su
[sudo] password for cndaqiang:
[root@master cndaqiang]# ibnodes
Ca : 0xb8599f0300d0134e ports 1 "client01 mlx5_0"
Ca : 0xb8599f0300d0135e ports 1 "master mlx5_0"
本文首发于我的博客@cndaqiang.
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!