cndaqiang Web Linux DFT

Centos 配置IB网络

2020-10-27
cndaqiang
RSS

安装依赖

如果ib网卡不识别,或着识别后配置后无法启用,安装ib驱动

yum install -y infiniband-diags
yum install -y opensm
systemctl start opensm
systemctl enable opensm

后期编译mvapich时需要ib库,

yum install -y libibverbs
yum install -y libibverbs-devel
yum install -y libibmad-devel

计算节点也要安装这些库,如果计算节点没装,管理节点安装了编译的程序,在管理节点编译安装没有问题,提交作业就会报错,如

/home/users/cndaqiang/soft/gnu4-mvapich/R-TDAP/Obj/tdap: error while loading shared libraries: libibmad.so.5: cannot open shared object file: No such file or directory
/home/users/cndaqiang/soft/gnu4-mvapich/R-TDAP/Obj/tdap: error while loading shared libraries: libgfortran.so.3: cannot open shared object file: No such file or directory
srun: error: hpcc045: tasks 0-35: Exited with exit code 127

网络配置

同Linux普通Eth网卡配置 vi /etc/sysconfig/network-scripts/ifcfg-ib0

CONNECTED_MODE=no
TYPE=InfiniBand
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=static
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ib0
UUID=3491f03d-656b-45a3-b66f-fd3c6b6d6968
DEVICE=ib0
ONBOOT=yes
IPADDR=172.16.100.7
PREFIX=24

如果没有该文件,新建时查询UUID方法

yum -y install NetworkManager
service NetworkManager start
[cndaqiang@master ~]$ nmcli con 
NAME    UUID                                  TYPE        DEVICE 
eno1    5e7bcf02-9afa-4bfc-9712-6bdf6126ae58  ethernet    eno1   
ib0     3491f03d-656b-45a3-b66f-fd3c6b6d6968  infiniband  ib0    
virbr0  a945b5c5-ec6d-4ed5-9a97-d7aa725f8f21  bridge      virbr0

重启网络

systemctl restart network

两台集群IB直连

可以配置,和上面相同,只要配置好ip,IB线直接插在两个机器的IB口即可
但是一台机器关机,另一台机器的IB网卡就断开了,在IB网卡上ip就没法用了,基于该ip配置的服务全部崩,直到另一台机器上线才能恢复

还是采用IB交换机可玩性更高

IB相关配置

查询信息

ibnodes
ibstatus

报错

[cndaqiang@master ~]$ ibnodes
ibwarn: [5809] mad_rpc_open_port: can't open UMAD port ((null):0)
src/ibnetdisc.c:788; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed
ibwarn: [5814] mad_rpc_open_port: can't open UMAD port ((null):0)
src/ibnetdisc.c:788; can't open MAD port ((null):0)
/usr/sbin/ibnetdiscover: iberror: failed: discover failed

切换root身份即可

[cndaqiang@master ~]$ sudo su
[sudo] password for cndaqiang:
[root@master cndaqiang]# ibnodes
Ca	: 0xb8599f0300d0134e ports 1 "client01 mlx5_0"
Ca	: 0xb8599f0300d0135e ports 1 "master mlx5_0"

本文首发于我的博客@cndaqiang.
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!



评论


广告

目录

广告
访客数据