cndaqiang Web Linux DFT

Ubuntu 18.04/Mint 19 单机安装Slurm

2020-02-24
cndaqiang
RSS

参考

Installing Slurm Workload Manager & Job Scheduler on Ubuntu 18.04

Centos系统安装方法Centos7集群上搭建slurm作业管理系统

安装

sudo apt install slurm-wlm slurm-wlm-doc -y 

查看信息

cndaqiang@girl:~$ slurmd -V
slurm-wlm 17.11.2
cndaqiang@girl:~$ slurmd -C
NodeName=girl CPUs=4 Boards=1 SocketsPerBoard=1 CoresPerSocket=2 ThreadsPerCore=2 RealMemory=15939
UpTime=0-08:01:18

配置

修改配置文件

rm /etc/slurm-llnl/slurm.conf 
vi /etc/slurm-llnl/slurm.conf 

根据 https://www.linuxwave.info/2019/10/installing-slurm-workload-manager-job.html

浏览器打开/usr/share/doc/slurm-wlm-doc/html/configurator.easy.html 添加各种配置和路径,生成配置文件,复制到vi /etc/slurm-llnl/slurm.conf 注意

Some of the configuration that I changed from the default
- Make sure the hostname of the system is ControlMachine and NodeName
- State Preservation: set StateSaveLocation to /var/spool/slurm-llnl
- Process tracking: use Pgid instead of Cgroup
- Process ID logging: set this to /var/run/slurm-llnl/slurmctld.pid and /var/run/slurm-llnl/slurmd.pid

我的配置文件

# slurm.conf file generated by configurator easy.html.
# Put this file on all nodes of your cluster.
# See the slurm.conf man page for more information.
#
ControlMachine=girl
#ControlAddr=
#
#MailProg=/bin/mail
MpiDefault=none
#MpiParams=ports=#-#
ProctrackType=proctrack/pgid
ReturnToService=1
SlurmctldPidFile=/var/run/slurm-llnl/slurmctld.pid
#SlurmctldPort=6817
SlurmdPidFile=/var/run/slurm-llnl/slurmd.pid
#SlurmdPort=6818
SlurmdSpoolDir=/var/spool/slurmd
SlurmUser=slurm
#SlurmdUser=root
StateSaveLocation=/var/spool/slurm-llnl
SwitchType=switch/none
TaskPlugin=task/none
#
#
# TIMERS
#KillWait=30
#MinJobAge=300
#SlurmctldTimeout=120
#SlurmdTimeout=300
#
#
# SCHEDULING
FastSchedule=1
SchedulerType=sched/backfill
SelectType=select/linear
#SelectTypeParameters=
#
#
# LOGGING AND ACCOUNTING
AccountingStorageType=accounting_storage/none
ClusterName=cluster
#JobAcctGatherFrequency=30
JobAcctGatherType=jobacct_gather/none
#SlurmctldDebug=3
#SlurmctldLogFile=
#SlurmdDebug=3
#SlurmdLogFile=
#
#
# COMPUTE NODES
NodeName=girl CPUs=4 State=UNKNOWN
PartitionName=debug Nodes=girl Default=YES MaxTime=INFINITE State=UP

修改完vi /etc/slurm-llnl/slurm.conf配置文件后,与配置文件对应,创建相应的目录和权限设置

rm -rf  /var/spool/slurm-llnl
mkdir /var/spool/slurm-llnl
chown -R slurm.slurm /var/spool/slurm-llnl
rm -rf /var/run/slurm-llnl/
mkdir /var/run/slurm-llnl/
chown -R slurm.slurm /var/run/slurm-llnl/

启动

启动和设置开机自启 Start slurmd and enable on boot

systemctl start slurmd
systemctl enable slurmd
systemctl start slurmctld
systemctl enable slurmctld

修改配置后的重启命令

systemctl restart  slurmctld
systemctl restart  slurmd

使用

提交作业

作业脚本

#!/bin/bash
#
#SBATCH --job-name=manytdap
#SBATCH --output=tdap.out
#SBATCH -N 1
#SBATCH --ntasks-per-node=4
#SBATCH --time=00:30:00
#SBATCH -p debug



EXEC=/home/cndaqiang/work/siesta/siesta-4.1-b1/Obj/siesta

mpirun $EXEC input.fdf | tee result
#注
#默认mpirun使用核数由ntasks-per-node*N确定
#也可用 -np 指定
#mpirun -np 4  $EXEC -i input.in | tee result

提交作业,查看队列计算情况

cndaqiang@girl:~/work/slurm$ sbatch run-slurm.sh
Submitted batch job 21
cndaqiang@girl:~/work/slurm$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                21     debug manytdap cndaqian  R       0:01      1 girl

提交到制定节点

sbatch -w hpcc148 run-wannier.sh

用pbs的命令控制slurm

sudo apt install slurm-wlm-torque

就可以使用qstat,qsub 等pbs的命令了

更多设置

允许单节点提交多个任务

感谢myth提供的方案.
修改配置文件vi /etc/slurm-llnl/slurm.conf中的SchedulerTypeSelectType

#SchedulerType=sched/backfill
#SelectType=select/linear
SelectType=select/cons_res
SelectTypeParameters=CR_CPU

重启slurm

systemctl restart  slurmctld
systemctl restart  slurmd

Sharing Consumable Resources 我的四核笔记本就可以交四个作业了

(python37) cndaqiang@girl:~/work/slurm$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
               125     debug manytdap cndaqian PD       0:00      1 (Resources)
               121     debug manytdap cndaqian  R       0:08      1 girl
               122     debug manytdap cndaqian  R       0:05      1 girl
               123     debug manytdap cndaqian  R       0:05      1 girl
               124     debug manytdap cndaqian  R       0:05      1 girl

单节点独占任务

当slurm是管理员配置, 普通用户希望独占节点,在作业脚本中添加

#SBATCH --exclusive

slurm其他用法

交完作业后修改性质

参考SLURM 使用参考

scontrol update jobid=JOBID ...
#如
account=<account>                      mintmpdisknode=<megabytes>             reqnodelist=<nodes>
conn-type=<type>                       name>                                  reqsockets=<count>
contiguous=<yes|no>                    name=<name>                            reqthreads=<count>
dependency=<dependency_list>           nice[=delta]                           requeue=<0|1>
eligibletime=yyyy-mm-dd                nodelist=<nodes>                       reservationname=<name>
excnodelist=<nodes>                    numcpus=<min_count[-max_count]         rotate=<yes|no>
features=<features>                    numnodes=<min_count[-max_count]>       shared=<yes|no>
geometry=<geo>                         numtasks=<count>                       starttime=yyyy-mm-dd
gres=<list>                            or                                     switches=<count>[@<max-time-to-wait>]
licenses=<name>                        partition=<name>                       timelimit=[d-]h:m:s
mincpusnode=<count>                    priority=<number>                      userid=<UID
minmemorycpu=<megabytes>               qos=<name>                             wckey=<key>
minmemorynode=<megabytes>              reqcores=<count>

修改示例

#修改时间
scontrol update jobid=33932 timelimit=6:00:00
#修改队列  
scontrol update jobid=33932 partition=<name>          
#同时修改多个
scontrol update jobid=580,581 partition=long

作业运行后普通用户不能修改时间限制,但是root可以,而且可以改大于队列限制

查看作业信息

scontrol show job  67139
scontrol show job  671399 | grep WorkDir
   WorkDir=/public/home/cndaqiang/work/tdap/blocksize/origin/Si32/Si32-12

申请一个节点两个核的信息示例

(python37) cndaqiang@girl:~/work/slurm$ scontrol show job 127
JobId=127 JobName=manytdap
   UserId=cndaqiang(1000) GroupId=cndaqiang(1000) MCS_label=N/A
   Priority=4294901749 Nice=0 Account=(null) QOS=(null)
   JobState=RUNNING Reason=None Dependency=(null)
   Requeue=1 Restarts=0 BatchFlag=1 Reboot=0 ExitCode=0:0
   RunTime=00:00:05 TimeLimit=00:30:00 TimeMin=N/A
   SubmitTime=2020-06-13T18:13:00 EligibleTime=2020-06-13T18:13:00
   StartTime=2020-06-13T18:13:01 EndTime=2020-06-13T18:43:01 Deadline=N/A
   PreemptTime=None SuspendTime=None SecsPreSuspend=0
   LastSchedEval=2020-06-13T18:13:01
   Partition=debug AllocNode:Sid=girl:30999
   ReqNodeList=(null) ExcNodeList=(null)
   NodeList=girl
   BatchHost=girl
   NumNodes=1 NumCPUs=2 NumTasks=2 CPUs/Task=1 ReqB:S:C:T=0:0:*:*
   TRES=cpu=2,node=1,billing=2
   Socks/Node=* NtasksPerN:B:S:C=2:0:*:* CoreSpec=*
   MinCPUsNode=2 MinMemoryNode=0 MinTmpDiskNode=0
   Features=(null) DelayBoot=00:00:00
   Gres=(null) Reservation=(null)
   OverSubscribe=OK Contiguous=0 Licenses=(null) Network=(null)
   Command=/home/cndaqiang/work/slurm/run-slurm.sh
   WorkDir=/home/cndaqiang/work/slurm
   StdErr=/home/cndaqiang/work/slurm/run-slurm.sh.e127
   StdIn=/dev/null
   StdOut=/home/cndaqiang/work/slurm/run-slurm.sh.o127
   Power=

查看slurm环境变量

提交作业后,会增加的环境变量,在脚本里面使用下面的命令找到相应环境变量

env | grep SLURM
SLURM_CHECKPOINT_IMAGE_DIR=/var/slurm/checkpoint
SLURM_NODELIST=comput[110,123,130]
SLURM_JOB_NAME=slurm
SLURMD_NODENAME=comput110
SLURM_TOPOLOGY_ADDR=comput110
SLURM_NTASKS_PER_NODE=36
SLURM_PRIO_PROCESS=0
SLURM_NODE_ALIASES=(null)
SLURM_JOB_QOS=normal
SLURM_TOPOLOGY_ADDR_PATTERN=node
SLURM_NNODES=3
SLURM_JOBID=850925
SLURM_NTASKS=108
SLURM_TASKS_PER_NODE=36(x3)
SLURM_WORKING_CLUSTER=cluster_admin1:10.10.10.249:6817:8448
SLURM_JOB_ID=850925
SLURM_JOB_USER=cndaqiang
SLURM_JOB_UID=1011
SLURM_NODEID=0
SLURM_SUBMIT_DIR=/public/home/cndaqiang/work/slurm
SLURM_TASK_PID=176862
SLURM_NPROCS=108
SLURM_HOME=/opt/gridview/slurm
SLURM_CPUS_ON_NODE=36
SLURM_PROCID=0
SLURM_JOB_NODELIST=comput[110,123,130]
SLURM_LOCALID=0
SLURM_JOB_GID=100
SLURM_JOB_CPUS_PER_NODE=36(x3)
SLURM_CLUSTER_NAME=cluster_admin1
SLURM_GTIDS=0
SLURM_SUBMIT_HOST=login2
SLURM_JOB_PARTITION=regular
SLURM_JOB_ACCOUNT=sf10
SLURM_JOB_NUM_NODES=3
SLURM_MEM_PER_NODE=181673

sacctmgr show

有些使用数据库的slurm还支持下面的命令

(python37) [cndaqiang@login3 ~]$ sacctmgr show 
No valid entity in list command
Input line must include "Account", "Association", "Cluster", "Configuration",
"Event", "Federation", "Problem", "QOS", "Resource", "Reservation",
"RunAwayJobs", "Stats", "Transaction", "TRES", "User", or "WCKey"
(python37) [cndaqiang@login3 ~]$ sacctmgr show Resource
      Name     Server     Type  Count % Allocated ServerType 
---------- ---------- -------- ------ ----------- ---------- 
(python37) [cndaqiang@login3 ~]$ sacctmgr show Stats
sacctmgr: RC:2002 Your user doesn't have privilege to perform this action
(python37) [cndaqiang@login3 ~]$ sacctmgr show qos              

不重启服务更新配置

scontrol reconfig

交到指定节点

#SBATCh -w comput6 

提高任务优先级

服务器没有配置好每个人最多同时在算的作业数,某同学批量提交作业,导致后续同学的作业一直在排队,可先将其他同学的任务提前

#查看批量作业的最高优先级
[root@mgmt cndaqiang]# scontrol show job=8914 | grep Priority | grep QOS
   Priority=4294894516 Nice=0 Account=XXX QOS=normal
#查看后面同学的优先级
[root@mgmt cndaqiang]# scontrol show job=8930 | grep Priority | grep QOS
   Priority=4294894500 Nice=0 Account=YYY QOS=normal
#更改优先级
[root@mgmt cndaqiang]# scontrol update job=8930 Priority=4294894532 
[root@mgmt cndaqiang]# scontrol show job=8930 | grep Priority  | grep QOS
   Priority=4294894532 Nice=0 Account=YYY QOS=normal

本文首发于我的博客@cndaqiang.
本博客所有文章除特别声明外,均采用 CC BY-SA 4.0 协议 ,转载请注明出处!



评论


广告

目录

广告
访客数据