基于pbs作业系统的集群,支持NIS,NFS服务。
本文最初发在Centos 7 集群搭建
在本文的基础上继续搭建slurm作业系统Centos7集群上搭建slurm作业管理系统
创建用户
useradd -d 用户家目录 -M 用户名
passwd 用户名
# 添加到sudo组
usermod -aG wheel 用户名
挂载安装光盘作为本地yum仓库
没有外网需要安装程序时,可通过挂载光驱/iso文件作为本地仓库
挂载安装光盘
挂载iso文件
mkdir /media/CentOS-7-x86_64-Everything-1908
mount -o loop /opt/source/CentOS-7-x86_64-Everything-1908.iso /media/CentOS-7-x86_64-Everything-1908
or 挂载光驱
mkdir /media/diskOS
mount /dev/cdrom /media/diskOS #挂载光盘
配置光盘为yum源
- 挂载iso/光驱
- 配置yum
cd /etc/yum.repos.d/ #取消网络yum mv CentOS-Base.repo CentOS-Base.repo.bak #配置光盘镜像 vi CentOS-Media.repo
修改
[c7-media] name=CentOS-$releasever - Media #把挂载目录添加到baseurl [c7-media] name=CentOS-$releasever - Media baseurl=file:///media/CentOS-7-x86_64-Everything-1908 #此处添加新的挂载地址 file:///media/cdrom/ file:///media/cdrecorder/ gpgcheck=1 enabled=1 #此处设置1打开 gpgkey=file:///etc/pki/rpm-gpg/RPM-GPG-KEY-CentOS-7
- 安装测试
[root@node8 source]# yum install python 已加载插件:fastestmirror, langpacks Loading mirror speeds from cached hostfile * c7-media: c7-media | 3.6 kB 00:00 #<此处看出已经替换为了光盘的源 正在解决依赖关系 --> 正在检查事务 ---> 软件包 python.x86_64.0.2.7.5-68.el7 将被 升级
EPEL仓库
很多程序的安装在本身的仓库中是没有的,要使用epel进行安装 需要CentOS-Base.repo存在,光盘中没有epel
yum install epel-release
网络配置
改主机名
#永久
hostnamectl --static set-hostname newname
修改更新/etc/hosts
网络
开机启用网卡
vi /etc/sysconfig/network-scripts/ifcfg-eth0
#设置ONBOOT=yes则开机启动
#同时也是网卡配置文件
临时开启
ifup eth0
关闭
ifdown eth0
网卡配置文件规则
vi /etc/sysconfig/network-scripts/ifcfg-ens34 #不同的网卡 ifcfg-跟的后缀不同
TYPE=Ethernet
PROXY_METHOD=none
BROWSER_ONLY=no
BOOTPROTO=dhcp
DEFROUTE=yes
IPV4_FAILURE_FATAL=no
IPV6INIT=yes
IPV6_AUTOCONF=yes
IPV6_DEFROUTE=yes
IPV6_FAILURE_FATAL=no
IPV6_ADDR_GEN_MODE=stable-privacy
NAME=ens34
UUID=bf1e0d23-8b7a-40e0-83e7-1dcbb072206c
DEVICE=ens34
ONBOOT=yes
NIS 账户管理服务器
第十四章、账号控管: NIS 服务器 Centos7- NIS环境搭建 在Centos7上安装NIS Network Information Services (NIS server) 最早应该是称为 Sun Yellow Pages (简称 yp) 一台服务器(管理节点)保存用户账户密码,所有服务器用户都保存在NIS服务器上
- yp-tools :提供 NIS 相关的查寻指令功能
- ypbind :提供 NIS Client 端的设定软件
- ypserv :提供 NIS Server 端的设定软件
- rpcbind :就是 RPC 一定需要的数据啊!
关闭Selinux和防火墙
vim /etc/selinux/config
设置
SELINUX=disabled
关闭防火墙
systemctl stop firewalld
systemctl disable firewalld
reboot
服务器
yum -y install ypserv ypbind yp-tools
设置域
echo "NISDOMAIN=CC19" >> /etc/sysconfig/network
启动服务
systemctl start ypserv
systemctl start rpcbind
systemctl start yppasswdd
#开机启动
systemctl enable ypserv
systemctl enable rpcbind
systemctl enable yppasswdd
创建用户
[root@mom01 ~]# sudo adduser --home /home/c1 c1
[root@mom01 ~]# passwd c1
Changing password for user c1.
New password:
Retype new password:
passwd: all authentication tokens updated successfully.
更新用户信息数据库
/usr/lib64/yp/ypinit -m
# 新增账户时,需要更新数据库
make -C /var/yp
客户端
yum -y install ypbind yp-tools
设置域
echo "NISDOMAIN=CC19" >> /etc/sysconfig/network
设置用户认证信息读取顺序
vim /etc/nsswitch.conf
在存在的认证方式前添加nis
passwd: files sss nis
shadow: files sss nis
group: files sss nis
#initgroups: files sss
#hosts: db files nisplus nis dns
hosts: files nis dns myhostname
设置ns客户端
vim /etc/yp.conf
添加domain CC19 broadcast
,指定服务器地址也行(即,如果后面的systemctl start ypbind 卡死,设置为
domain CC19 server 服务器hostname或ip`)
vim /etc/sysconfig/authconfig
修改USENIS=yes
vim /etc/pam.d/system-auth
添加nispassword sufficient pam_unix.so sha512 shadow nis nullok try_first_pass use_authtok
启动
systemctl start rpcbind
systemctl start ypbind
systemctl enable rpcbind
systemctl enable ypbind
client测试
[root@boy01 ~]# yptest
Test 1: domainname
Configured domainname is "CC19"
Test 2: ypbind
Used NIS server: mom01
Test 3: yp_match
WARNING: No such key in map (Map passwd.byname, key nobody)
Test 4: yp_first
c1 c1:$6$v6nN75rW$20nFRAqDHqGVOyXFxaRUtG5ZeiySw7.hZkg0HfvD2kb/g/5daMfRCMAxuEG0nelTjZ//VaSKx0E8CRClhxzcP.:1001:1001::/home/c1:/bin/bash
Test 5: yp_next
cndaqang cndaqang:$6$aBRr9Jo4YWMFHhFQ$Wk.5DGrfsxipEVQPM2NaKJw.3sLdGlGUKGGxZzGFeYfiFqtHEI.B68akDTeNZN77SMkGPYZmQ1Z2iyUfYRYd4/:1000:1000:cndaqang:/home/cndaqang:/bin/bash
Test 6: yp_master
mom01
Test 7: yp_order
1568563871
Test 8: yp_maplist
netid.byname
mail.aliases
protocols.byname
protocols.bynumber
services.byservicename
services.byname
rpc.bynumber
rpc.byname
hosts.byaddr
hosts.byname
group.bygid
group.byname
passwd.byuid
passwd.byname
ypservers
Test 9: yp_all
c1 c1:$6$v6nN75rW$20nFRAqDHqGVOyXFxaRUtG5ZeiySw7.hZkg0HfvD2kb/g/5daMfRCMAxuEG0nelTjZ//VaSKx0E8CRClhxzaP.:1001:1001::/home/c1:/bin/bash
cndaqang cndaqang:$6$aBRr9Jo4YWMFHhFQ$Wk.5DGrfsxipEVQPM2NaKJw.3sLdGlGUKGGxZzGFeYfiFqtHEI.B68akDTeNZN77SMkGPYZmQ1Z2iyUfYRYd5/:1000:1000:cndaqang:/home/cndaqang:/bin/bash
1 tests failed
查看nis服务器
[root@boy01 ~]# ypwhich
mom01
开启防火墙的设置
因为计算节点boyxx和管理节点mom之间还有局域网, 可以把此部分的防火墙端口打开,关闭连接外网的端口
查看NIS服务器端口占用
[root@mom01 ~]# rpcinfo -p localhost
program vers proto port service
100000 4 tcp 111 portmapper
100000 3 tcp 111 portmapper
100000 2 tcp 111 portmapper
100000 4 udp 111 portmapper
100000 3 udp 111 portmapper
100000 2 udp 111 portmapper
100004 2 udp 992 ypserv
100004 1 udp 992 ypserv
100004 2 tcp 995 ypserv
100004 1 tcp 995 ypserv
100009 1 udp 1007 yppasswdd
由于linux端口范围 1024以下是系统保留的,从1024-65535是用户使用的,子网11.11.11.0/24,关闭默认防火墙,使用iptables 可以
- 打开内网网卡的防火墙
iptables -A INPUT -i ens33 -j ACCEPT
- 允许内网ip通过
iptables -I INPUT -s 11.11.11.0/24 -j ACCEPT
NFS文件系统共享服务器
CentOS 7 下 yum 安装和配置 NFS NFS 是 Network File System 的缩写,即网络文件系统。即计算节点可以执行管理节点的程序,查看用户的文件
服务器端
yum -y install nfs-utils
systemctl enable rpcbind
systemctl enable nfs
systemctl start rpcbind
systemctl start nfs
#查看端口
rpcinfo -p localhost | grep nfs
100003 3 tcp 2049 nfs
100003 4 tcp 2049 nfs
100227 3 tcp 2049 nfs_acl
100003 3 udp 2049 nfs
100003 4 udp 2049 nfs
100227 3 udp 2049 nfs_acl
设置共享目录,此处设置计算程序目录/opt
,用户数据目录/home
chmod 755 /home
chmod 755 /opt
添加到配置文件vi /etc/exports
# shareDir ip(rw,no_root_squash,no_all_squash,sync)
# ip 192.168.0.0/24: 客户端 IP 范围,* 代表所有,即没有限制。
# rw: 权限设置,可读可写。
# sync: 同步共享目录。
# no_root_squash: 可以使用 root 授权。
# no_all_squash: 可以使用普通用户授权。
/home *(rw,no_root_squash,sync)
/opt *(rw,no_root_squash,sync)
重启服务
systemctl restart nfs
查看共享情况
[root@mom01 home]# showmount -e localhost
Export list for localhost:
/opt *
/home *
防火墙已设置对集群内网打开
客户端
yum -y install nfs-utils
systemctl enable rpcbind
systemctl start rpcbind
客户端不需要打开防火墙,因为客户端时发出请求方,网络能连接到服务端即可。
客户端也不需要开启 NFS 服务,因为不共享目录。
查看管理节点nfs情况
[root@boy01 ~]# showmount -e mom01
Export list for mom01:
/opt *
/home *
挂载
[root@boy01 ~]# mount mom01:/opt /opt
[root@boy01 ~]# mount mom01:/home /home
[root@boy01 ~]# ls /home/
c1 cndaqang test
开机挂载,添加mount命令到开机脚本,或添加到/etc/fstab
mom01:/home /home nfs defaults 0 0
mom01:/opt /opt nfs defaults 0 0
防火墙iptables
CentOS7安装iptables防火墙 Linux 下的(防火墙)iptables
#install
yum -y install iptables iptables-services
#remove default firewalld
systemctl stop firewalld
systemctl disable firewalld
systemctl mask firewalld
查看iptables现有规则iptables -nvL
[root@mom01 ~]# iptables -nvL
Chain INPUT (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
100 7320 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
10 832 ACCEPT all -- lo * 0.0.0.0/0 0.0.0.0/0
Chain FORWARD (policy DROP 0 packets, 0 bytes)
pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 76 packets, 9016 bytes)
pkts bytes target prot opt in out source destination
删除规则iptables -D
[root@mom01 ~]# iptables -nvL --line-number
Chain INPUT (policy DROP 2 packets, 148 bytes)
num pkts bytes target prot opt in out source destination
1 184 13504 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
2 16 1444 ACCEPT all -- lo * 0.0.0.0/0 0.0.0.0/0
3 0 0 ACCEPT all -- ens33 * 0.0.0.0/0 0.0.0.0/0
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 21 packets, 2928 bytes)
num pkts bytes target prot opt in out source destination
[root@mom01 ~]# iptables -D INPUT 3
[root@mom01 ~]# iptables -nvL --line-number
Chain INPUT (policy DROP 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
1 203 14796 ACCEPT tcp -- * * 0.0.0.0/0 0.0.0.0/0 tcp dpt:22
2 16 1444 ACCEPT all -- lo * 0.0.0.0/0 0.0.0.0/0
Chain FORWARD (policy DROP 0 packets, 0 bytes)
num pkts bytes target prot opt in out source destination
Chain OUTPUT (policy ACCEPT 4 packets, 464 bytes)
num pkts bytes target prot opt in out source destination
增加规则
#允许所有
iptables -P INPUT ACCEPT
#清空所有默认规则**若未`iptables -P INPUT ACCEPT`, 会清空默认的ssh**
iptables -F
#清空所有自定义规则
iptables -X
#所有计数器归0
iptables -Z
#允许来自于lo接口的数据包(本地访问),若无此,很多服务比如ypserv无法启动
iptables -A INPUT -i lo -j ACCEPT
#-A代表缀加,-I 插入带开头, -i网卡 -j操作
#允许来自于内网网卡ens33的数据包(集群访问)
iptables -A INPUT -i ens33 -j ACCEPT
#允许11.11.11.1访问所有端口
iptables -I INPUT -s 11.11.11.1 -j ACCEPT
#允许11.11.11.0/24网段ip访问所有端口(集群操作)
iptables -I INPUT -s 11.11.11.0/24 -j ACCEPT
#开放22端口
iptables -A INPUT -p tcp --dport 22 -j ACCEPT
#开放21端口(FTP)
iptables -A INPUT -p tcp --dport 21 -j ACCEPT
#开放80端口(HTTP)
iptables -A INPUT -p tcp --dport 80 -j ACCEPT
#开放443端口(HTTPS)
iptables -A INPUT -p tcp --dport 443 -j ACCEPT
#允许ping
iptables -A INPUT -p icmp --icmp-type 8 -j ACCEPT
##允许接受本机请求之后的返回数据 RELATED,是为FTP设置的
#iptables -A INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
#其他入站一律丢弃
iptables -P INPUT DROP
#所有出站一律绿灯
iptables -P OUTPUT ACCEPT
#所有转发一律丢弃
iptables -P FORWARD DROP
保存生效
执行iptables
后立即生效,重启有效
systemctl enable iptables
service iptables save
pbs作业管理系统
管理节点mom01
程序都安装在/opt
, 管理节点和计算节点共享/opt
目录
注意在/etc/hosts
中HOSTNAME那一条,HOSTNAME一定要在其他域名前面
yum -y install libxml2-devel openssl-devel gcc gcc-c++ boost-devel libtool
./configure --prefix=/opt/CC19/torque6 --with-default-server=$HOSTNAME
#--prefix= 程序安装目录
#配置文件目录默认TORQUE_HOME=/var/spool/torque,可以指定为--with-server-home=/opt/CC19/torque6/share
make -j8
make install
#PATH
export TORQUE_HOME=/var/spool/torque
export TORQUE_ROOT=/opt/CC19/torque6
export PATH=$TORQUE_ROOT/bin:$TORQUE_ROOT/sbin:$PATH
#并把上述PATH添加到`/etc/profile`
#添加到系统服务
cp contrib/init.d/{pbs_{server,sched,mom},trqauthd} /etc/init.d/
systemctl daemon-reload
make packages
libtool --finish /opt/CC19/torque6/lib
初始化
./torque.setup root
hostname: mom01
Currently no servers active. Default server will be listed as active server. Error 15133
Active server name: mom01 pbs_server port is: 15001
trqauthd daemonized - port /tmp/trqauthd-unix
trqauthd successfully started
initializing TORQUE (admin: root)
You have selected to start pbs_server in create mode.
If the server database exists it will be overwritten.
do you wish to continue y/(n)?y
配置文件(计算节点)
vi $TORQUE_HOME/server_priv/nodes
# TORQUE_HOME具体位置由前面指定
#填入计算节点 np=核数 队列名, 如
mom01 np=4 regular
boy01 np=8 regular
配置regular队列
[root@mom01 torque-6.1.1.1]# qmgr
Max open servers: 9
Qmgr: creat queue regular queue_type=execution
Qmgr: set server default_queue=regular
Qmgr: set queue regular started=true
Qmgr: set queue regular enabled=true
Qmgr: set server scheduling=true
添加到开机自起
for i in trqauthd pbs_server pbs_sched ; do systemctl enable $i.service ; done
启动pbs服务
for i in trqauthd pbs_server pbs_sched ; do systemctl restart $i.service ; done
#检查是否启动完成
for i in trqauthd pbs_server pbs_sched ; do systemctl status $i.service ; done
查看节点信息(注:此处还未配置计算节点mom01, boy01,所以是down状态)
[root@mom01 torque-6.1.1.1]# qnodes -l all
mom01 down
boy01 down
计算节点mom01,boy01
注:已配置NIS,NFS
计算节点mom01(单机pbs)
#修改配置文件
[root@mom01 torque-6.1.1.1]# vi $TORQUE_HOME/mom_priv/config
#加入下面两行
$pbsserver mom01
$logevent 255
#启动服务
#开机启动
systemctl enable pbs_mom.service
[root@mom01 torque-6.1.1.1]# systemctl restart pbs_mom.service
[root@mom01 torque-6.1.1.1]# systemctl status pbs_mom.service
● pbs_mom.service - TORQUE pbs_mom daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_mom.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-09-17 13:21:36 CST; 6s ago
Process: 75096 ExecStop=/bin/bash -c for i in {1..5}; do kill -0 $MAINPID &>/dev/null || exit 0; /opt/CC19/torque6/sbin/momctl -s && exit; sleep 1; done (code=exited, status=0/SUCCESS)
Main PID: 75102 (pbs_mom)
Tasks: 11
CGroup: /system.slice/pbs_mom.service
└─75102 /opt/CC19/torque6/sbin/pbs_mom -F -d /var/spool/torque
Sep 17 13:21:36 mom01 systemd[1]: Started TORQUE pbs_mom daemon.
Sep 17 13:21:36 mom01 systemd[1]: Starting TORQUE pbs_mom daemon...
回到管理节点查看:mom01已上线
[root@mom01 torque-6.1.1.1]# qnodes -l all
mom01 free
boy01 down
计算节点boy01(集群)
登陆计算节点boy01,注:已配置NIS,NFS
cd /opt/sourcecode/torque-6.1.1.1/
./torque-package-clients-linux-x86_64.sh --install
./torque-package-mom-linux-x86_64.sh --install
#修改配置文件
[root@boy01 torque-6.1.1.1]# vi $TORQUE_HOME/mom_priv/config
#加入下面两行
$pbsserver mom01
$logevent 255
#启动服务
[root@boy01 torque-6.1.1.1]# systemctl restart pbs_mom.service
[root@boy01 torque-6.1.1.1]# systemctl status pbs_mom.service
回到管理节点查看
[root@mom01 torque-6.1.1.1]# qnodes -l all
mom01 free
boy01 free
增加计算节点,修改管理节点和计算节点配置文件重启即可
交作业
[c1@mom01 ~]$ rm result
[c1@mom01 ~]$ cat run.pbs
#PBS -N test
#PBS -l nodes=1:ppn=4
#PBS -l walltime=1:00:00:00
#PBS -q regular
#PBS -V
#PBS -S /bin/bash
#PBS -o test.out
#
cd $PBS_O_WORKDIR
echo "hello"
echo $HOSTNAME >> result
ping -c 4 11.11.11.100
sleep 60
[c1@mom01 ~]$ qsub run.pbs
3.mom01
[c1@mom01 ~]$ qsub run.pbs
4.mom01
[c1@mom01 ~]$ qstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
3.mom01 test c1 0 R regular
4.mom01 test c1 0 R regular
[c1@mom01 ~]$ qstat -f 3 | grep host
exec_host = mom01/0-3
submit_host = mom01
[c1@mom01 ~]$ qstat -f 4 | grep host
exec_host = boy01/0-3
submit_host = mom01
#查看结果
[c1@mom01 ~]$ cat result
mom01
boy01
pbs出错时 重启pbs服务,查看pbs_xxx status,找错误原因
配置文件写错
[root@mom01 torque-6.1.1.1]# qnodes
Unable to communicate with mom01(11.11.11.100)
Cannot connect to specified server host 'mom01'.
qnodes: cannot connect to server mom01, error=111 (Connection refused)
找到错误原因,解决,启动服务
[root@mom01 torque-6.1.1.1]# service pbs_server status
● pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Mon 2019-09-16 23:19:02 CST; 2s ago
Process: 30495 ExecStart=/opt/CC19/torque6/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, status=3)
Main PID: 30495 (code=exited, status=3)
Sep 16 23:19:02 mom01 systemd[1]: Started TORQUE pbs_server daemon.
Sep 16 23:19:02 mom01 systemd[1]: Starting TORQUE pbs_server daemon...
Sep 16 23:19:02 mom01 PBS_Server[30495]: LOG_ERROR::parse_node_name, invalid character in token "$logevent" on line 2
Sep 16 23:19:02 mom01 PBS_Server[30495]: LOG_ERROR::PBS_Server, pbsd_init failed
Sep 16 23:19:02 mom01 systemd[1]: pbs_server.service: main process exited, code=exited, status=3/NOTIMPLEMENTED
Sep 16 23:19:02 mom01 systemd[1]: Unit pbs_server.service entered failed state.
Sep 16 23:19:02 mom01 systemd[1]: pbs_server.service failed.
#错误原因是:/opt/CC19/torque6/server_priv/nodes 配置文件写错了
#删除配置文件中的 $logevent 255即可
[root@mom01 torque-6.1.1.1]# vi /opt/CC19/torque6/server_priv/nodes
[root@mom01 torque-6.1.1.1]# service pbs_server start
Starting pbs_server (via systemctl): [ OK ]
[root@mom01 torque-6.1.1.1]# service pbs_server status
● pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; disabled; vendor preset: disabled)
Active: active (running) since Mon 2019-09-16 23:19:26 CST; 2s ago
Main PID: 30565 (pbs_server)
Tasks: 10
CGroup: /system.slice/pbs_server.service
└─30565 /opt/CC19/torque6/sbin/pbs_server -F -d /opt/CC19/torque6
Sep 16 23:19:26 mom01 systemd[1]: Started TORQUE pbs_server daemon.
Sep 16 23:19:26 mom01 systemd[1]: Starting TORQUE pbs_server daemon...
Sep 16 23:19:26 mom01 pbs_server[30565]: Assertion failed, bad pointer in link: file "req_select.c", line 401
Sep 16 23:19:26 mom01 pbs_server[30565]: Assertion failed, bad pointer in link: file "req_select.c", line 401
[root@mom01 torque-6.1.1.1]# qnodes
boy01
state = down
power_state = Running
np = 8
properties = regular
ntype = cluster
mom_service_port = 15002
mom_manager_port = 15003
端口占用导致
[root@mom01 torque-6.1.1.1]# service pbs_server status
● pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2019-09-17 09:52:41 CST; 4min 29s ago
Process: 3987 ExecStart=/opt/CC19/torque6/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, statu
s=3)
Main PID: 3987 (code=exited, status=3)
Sep 17 09:52:41 mom01 systemd[1]: Started TORQUE pbs_server daemon.
Sep 17 09:52:41 mom01 systemd[1]: Starting TORQUE pbs_server daemon...
Sep 17 09:52:41 mom01 pbs_server[3987]: pbs_server port already bound: Address already in use
Sep 17 09:52:41 mom01 systemd[1]: pbs_server.service: main process exited, code=exited, status=3/...ENTED
Sep 17 09:52:41 mom01 systemd[1]: Unit pbs_server.service entered failed state.
Sep 17 09:52:41 mom01 systemd[1]: pbs_server.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
原因pbs_server port already bound
#查看端口
[root@mom01 torque-6.1.1.1]# netstat -ntulp
Active Internet connections (only servers)
Proto Recv-Q Send-Q Local Address Foreign Address State PID/Program name
tcp 0 0 0.0.0.0:15001 0.0.0.0:* LISTEN 3745/pbs_server
#发现存在端口占据,杀掉
[root@mom01 torque-6.1.1.1]# kill -9 3745
[root@mom01 torque-6.1.1.1]# service pbs_server status
● pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; disabled; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2019-09-17 09:52:41 CST; 5min ago
Process: 3987 ExecStart=/opt/CC19/torque6/sbin/pbs_server -F -d $PBS_HOME $PBS_ARGS (code=exited, statu
s=3)
Main PID: 3987 (code=exited, status=3)
Sep 17 09:52:41 mom01 systemd[1]: Started TORQUE pbs_server daemon.
Sep 17 09:52:41 mom01 systemd[1]: Starting TORQUE pbs_server daemon...
Sep 17 09:52:41 mom01 pbs_server[3987]: pbs_server port already bound: Address already in use
Sep 17 09:52:41 mom01 systemd[1]: pbs_server.service: main process exited, code=exited, status=3/...ENTEDSep 17 09:52:41 mom
01 systemd[1]: Unit pbs_server.service entered failed state.
Sep 17 09:52:41 mom01 systemd[1]: pbs_server.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
[root@mom01 torque-6.1.1.1]# service pbs_server start
Starting pbs_server (via systemctl): [ OK ]
[root@mom01 torque-6.1.1.1]# service pbs_server status
● pbs_server.service - TORQUE pbs_server daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_server.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-09-17 09:58:43 CST; 1s ago
Sep 17 09:52:41 mom01 systemd[1]: Started TORQUE pbs_server daemon. Sep 17 09:52:41 mom01 systemd[1]: Starting TORQUE pbs_server daemon...
添加计算节点
将管理节点添加到计算节点
systemctl daemon-reload 主节点,查看pbs_mom结果不同
[root@mom01 torque-6.1.1.1]# service pbs_mom status
pbs_mom running but subsys not locked [FAILED]
[root@mom01 torque-6.1.1.1]# systemctl status pbs_mom.service
● pbs_mom.service - TORQUE pbs_mom daemon
Loaded: loaded (/usr/lib/systemd/system/pbs_mom.service; disabled; vendor preset: disabled)
Active: active (running) since Tue 2019-09-17 10:11:11 CST; 1min 13s ago
Process: 5350 ExecStop=/bin/bash -c for i in {1..5}; do kill -0 $MAINPID &>/dev/null || exit 0; /opt/CC19/torque6/sbin/momctl -s
&& exit; sleep 1; done (code=exited, status=0/SUCCESS)
Main PID: 5369 (pbs_mom)
Tasks: 11
CGroup: /system.slice/pbs_mom.service
└─5369 /opt/CC19/torque6/sbin/pbs_mom -F -d /opt/CC19/torque6
Sep 17 10:11:11 mom01 systemd[1]: Started TORQUE pbs_mom daemon.
Sep 17 10:11:11 mom01 systemd[1]: Starting TORQUE pbs_mom daemon...
qnodes查看已经上线了
[root@mom01 torque-6.1.1.1]# qnodes -l all
boy01 down
mom01 free
共享文件与配置使得mom.lock冲突
[root@boy01 ~]# systemctl status pbs_mom.service
● pbs_mom.service - SYSV: pbs_mom is part of a batch scheduler
Loaded: loaded (/etc/rc.d/init.d/pbs_mom; bad; vendor preset: disabled)
Active: failed (Result: exit-code) since Tue 2019-09-17 10:17:43 CST; 40s ago
Docs: man:systemd-sysv-generator(8)
Process: 2657 ExecStart=/etc/rc.d/init.d/pbs_mom start (code=exited, status=1/FAILURE)
Sep 17 10:17:42 boy01 systemd[1]: Starting SYSV: pbs_mom is part of a batch scheduler...
Sep 17 10:17:43 boy01 pbs_mom[2673]: LOG_ERROR::Resource temporarily unavailable (11) in pbs_mom, cannot lock '/opt/CC19/torque6/mom_priv/mom.lock...om running
Sep 17 10:17:43 boy01 pbs_mom[2657]: Starting TORQUE Mom: cannot lock '/opt/CC19/torque6/mom_priv/mom.lock' - another mom running
Sep 17 10:17:43 boy01 pbs_mom[2657]: [FAILED]
Sep 17 10:17:43 boy01 systemd[1]: pbs_mom.service: control process exited, code=exited status=1
Sep 17 10:17:43 boy01 systemd[1]: Failed to start SYSV: pbs_mom is part of a batch scheduler.
Sep 17 10:17:43 boy01 systemd[1]: Unit pbs_mom.service entered failed state.
Sep 17 10:17:43 boy01 systemd[1]: pbs_mom.service failed.
Hint: Some lines were ellipsized, use -l to show in full.
错误原因**Starting TORQUE Mom: cannot lock ‘/opt/CC19/torque6/mom_priv/mom.lock’ - another mom running ** 修改配置
pbs命令
更多命令参考单机centos编译安装使用PBS torque
交换作业顺序
[chendq@node8 core165]$ mqstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
12.node8.localdomain test chendq 0 Q regular
13.node8.localdomain test chendq 0 Q regular
14.node8.localdomain test chendq 0 Q regular
15.node8.localdomain tdap chendq 0 Q regular
[chendq@node8 core165]$ qorder 15.node8.localdomain 12.node8.localdomain
[chendq@node8 core165]$ mqstat
Job ID Name User Time Use S Queue
------------------------- ---------------- --------------- -------- - -----
12.node8.localdomain test chendq 0 Q regular
13.node8.localdomain test chendq 0 Q regular
14.node8.localdomain test chendq 0 Q regular
15.node8.localdomain tdap chendq 14:30:44 R regular
备份
备份磁盘
dd if=/dev/sdb of=mbr.bin
另开窗口,查看进度
watch -n 5 killall -USR1 dd
克隆磁盘
dd if=/dev/sdb of=/dev/sdc
dd if=/dev/sdb3 of=/dev/sdc3
其他问题
NFS共享HOME目录后,添加了authorized_key,仍需要输入密码
因为计算节点没有关闭SELinux
[root@client01 cndaqiang]# vi /etc/selinux/config
#设置 SELINUX=disable
#立即生效
[root@client01 cndaqiang]# setenforce 0
本文首发于我的博客@cndaqiang.
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!