HPC cluster
It is assumed here that you already deployed a generic cluster of nodes (see dedicated tutorial)
Vocabulary
The job scheduler is the conductor of computational resources of the cluster. We will use here the Slurm open source jobs scheduler.
A job is a small script, that contains instructions on how to execute the calculation program, and that also contains information for to the job scheduler (required job duration, how much resources are needed, etc.). When a user ask the job scheduler to execute a job, which is call submitting a job, the job enter jobs queue. The job scheduler is then in charge of finding free computational resources depending of the needs of the job, then launching the job and monitoring it during its execution. Note that the job scheduler is in charge of managing all jobs to ensure maximum usage of computational resources, which is why sometime, the job scheduler will put some jobs on hold for a long time in a queue, to wait for free resources. In return, after user has submitted a job, the job scheduler will provide user a job ID to allow following job state in the jobs queue and during its execution.
Build packages
Let's install now the cluster job scheduler, Slurm.
First, we need to build packages. Grab Munge and Slurm sources. Munge will be used to handle authentication between Slurm daemons.
Note: beware, links may change over time, especially Slurm from Schemd. You may need to update URLs.
wget https://github.com/dun/munge/releases/download/munge-0.5.14/munge-0.5.14.tar.xz
dnf install bzip2-devel openssl-devel zlib-devel -y
wget https://github.com/dun.gpg
wget https://github.com/dun/munge/releases/download/munge-0.5.14/munge-0.5.14.tar.xz.asc
rpmbuild -ta munge-0.5.14.tar.xz
Now install munge, as it is needed to build slurm:
cp /root/rpmbuild/RPMS/x86_64/munge-* /var/www/html/repositories/AlmaLinux/9/x86_64/extra/
createrepo /var/www/html/repositories/AlmaLinux/9/x86_64/extra/
dnf clean all
dnf install munge munge-libs munge-devel
Now build slurm packages:
wget https://download.schedmd.com/slurm/slurm-20.11.7.tar.bz2
dnf install munge munge-libs munge-devel
dnf install pam-devel readline-devel perl-ExtUtils-MakeMaker
dnf install mariadb mariadb-devel
rpmbuild -ta slurm-20.11.7.tar.bz2
cp /root/rpmbuild/RPMS/x86_64/slurm* /var/www/html/repositories/AlmaLinux/9/x86_64/extra/
createrepo /var/www/html/repositories/AlmaLinux/9/x86_64/extra/
dnf clean all
Slurm controller side is called slurmctld while on compute nodes, it is called slurmd . On the "submitter" node, no daemon except munge is required.
Tip: if anything goes wrong with slurm, proceed as following:
- Ensure time is exactly the same on nodes. If time is different, munge based authentication will fail.
- Ensure munge daemon is started, and that munge key is the same on all hosts (check md5sum for example).
- Stop slurmctld and stop slurmd daemons, and start them in two different shells manually in debug + verbose mode:
slurmctld -D -vvvvvvv
in shell 1 on controller server, andslurmd -D -vvvvvvv
in shell 2 on compute node. Output should help you understand the issue.
Controller
Install munge needed packages:
And generate a munge key:
We will spread this key over all servers of the cluster.
Lets start and enable munge daemon:
Now install slurm controller slurmctld
packages:
dnf install slurm slurm-slurmctld -y
groupadd -g 567 slurm
useradd -m -c "Slurm workload manager" -d /etc/slurm -u 567 -g slurm -s /bin/false slurm
mkdir /etc/slurm
mkdir /var/log/slurm
mkdir -p /var/spool/slurmd/StateSave
chown -R slurm:slurm /var/log/slurm
chown -R slurm:slurm /var/spool/slurmd
Lets create a very minimal slurm configuration.
Create file /etc/slurm/slurm.conf
with the following content:
# Documentation:
# https://slurm.schedmd.com/slurm.conf.html
## Controller
ClusterName=valhalla
ControlMachine=odin
## Authentication
SlurmUser=slurm
AuthType=auth/munge
CryptoType=crypto/munge
## Files path
StateSaveLocation=/var/spool/slurmd/StateSave
SlurmdSpoolDir=/var/spool/slurmd/slurmd
SlurmctldPidFile=/var/run/slurmctld.pid
SlurmdPidFile=/var/run/slurmd.pid
## Logging
SlurmctldDebug=5
SlurmdDebug=5
## We don't want a node to go back in pool without sys admin acknowledgement
ReturnToService=0
## Using pmi/pmi2/pmix interface for MPI
MpiDefault=pmi2
## Basic scheduling based on nodes
SchedulerType=sched/backfill
SelectType=select/linear
## Nodes definition
NodeName=valkyrie01 Procs=1 # Add more procs if your node have more CPU cores
NodeName=valkyrie02 Procs=1
## Partitions definition
PartitionName=all MaxTime=INFINITE State=UP Default=YES Nodes=valkyrie01,valkyrie02
Also create file /etc/slurm/cgroup.conf
with the following content:
And start slurm controller:
Using sinfo
command, you should now see the cluster started, with both computes nodes down for now.
Computes nodes
On both valkyrie01,valkyrie02
nodes, install munge the same way than on controller.
Ensure munge key generated on controller node is spread on each client. From odin
, scp the file:
clush -w valkyrie01,valkyrie02 --copy /etc/munge/munge.key --dest /etc/munge/munge.key
clush -bw valkyrie01,valkyrie02 chown munge:munge /etc/munge/munge.key
And start munge on each compute node:
clush -bw valkyrie01,valkyrie02 systemctl start munge
clush -bw valkyrie01,valkyrie02 systemctl enable munge
Now on each compute node, install slurmd needed packages:
clush -bw valkyrie01,valkyrie02 dnf clean all
clush -bw valkyrie01,valkyrie02 dnf install slurm slurm-slurmd -y
Now again, spread same slurm configuration files from odin
to each compute nodes:
clush -bw valkyrie01,valkyrie02 groupadd -g 567 slurm
clush -bw valkyrie01,valkyrie02 'useradd -m -c "Slurm workload manager" -d /etc/slurm -u 567 -g slurm -s /bin/false slurm'
clush -bw valkyrie01,valkyrie02 mkdir /etc/slurm
clush -bw valkyrie01,valkyrie02 mkdir /var/log/slurm
clush -bw valkyrie01,valkyrie02 mkdir -p /var/spool/slurmd/slurmd
clush -bw valkyrie01,valkyrie02 chown -R slurm:slurm /var/log/slurm
clush -bw valkyrie01,valkyrie02 chown -R slurm:slurm /var/spool/slurmd
clush -w valkyrie01,valkyrie02 --copy /etc/slurm/slurm.conf --dest /etc/slurm/slurm.conf
clush -w valkyrie01,valkyrie02 --copy /etc/slurm/cgroup.conf --dest /etc/slurm/cgroup.conf
And start on each compute node slurmd service:
clush -bw valkyrie01,valkyrie02 systemctl start slurmd
clush -bw valkyrie01,valkyrie02 systemctl enable slurmd
And simply test cluster works:
Now, sinfo shows that one or two node are idle, and srun allows to launch a basic job:
[root@odin ~]# sinfo
PARTITION AVAIL TIMELIMIT NODES STATE NODELIST
all* up infinite 1 unk* valkyrie02
all* up infinite 1 idle valkyrie01
[root@odin ~]# srun -N 1 hostname
valkyrie01.cluster.local
[root@odin ~]#
Submitter
Last step to deploy slurm is to install the login node, heimdall
, that will act as
a submitter. Uses should submit jobs from here.
A slurm submitter only need configuration files, and an active munge.
Install munge the same way than on controller.
Ensure munge key generated on controller node is spread here:
And start munge on heimdall
:
No install minimal slurm packages:
Now again, spread same slurm configuration files from odin
to heimdall
:
scp /etc/slurm/slurm.conf heimdall:/etc/slurm/slurm.conf
scp /etc/slurm/cgroup.conf heimdall:/etc/slurm/cgroup.conf
Nothing to start here, you can test sinfo
command from heimdall
to ensure it works.
Slurm cluster is now ready.
Submitting jobs
To execute calculations on the cluster, users will rely on Slurm to submit jobs and get calculation resources.
Submit commands are srun
and sbatch
.
Before using Slurm, it is important to understand how resources are requested. A calculation node is composed of multiple calculation cores. When asking for resources, it is possible to ask the following:
- I want this much calculations processes (one per core), do what is needed to provide them to me -> use
-n
. - I want this much nodes, I will handle the rest -> use
-N
. - I want this much nodes and I want you to start this much processes per nodes -> use
-N
combined with--ntasks-per-node
. - I want this much calculations processes (one per core), and this much processes per nodes, calculate yourself the number of nodes required for that -> use
--ntasks-per-node
combined with-n
. - Etc.
-N
, -n
and --ntasks-per-node
are complementary, and only two of them should be used at a time (slurm will deduce the last one using number of cores available on compute nodes as written in the slurm configuration file).
-N
specifies the total number of nodes to allocate to the job, -n
the total number of processes to start, and --ntasks-per-node
the number of processes to launch per node.
Here, we will see the following submittion ways:
- Submitting without a script
- Submitting a basic job script
- Submitting a serial job script
- Submitting an OpenMP job script
- Submitting an MPI job script
- A real life example with submitting a 3D animation render on a cluster combining Blender and Slurm arrays.
Submitting without a script
It is possible to launch a very simple job without a script, using the srun
command. To do that, use srun
directly, specifying the number of nodes required. For example:
Result can be: valkyrie01
Result can be :
Using this method is a good way to test cluster, or compile code on compute nodes directly, or just use the compute and memory capacity of a node to do simple tasks on it.
Basic job script
To submit a basic job scrip, user needs to use sbatch
command and provides it a script to execute which contains at the beginning some Slurm information.
A very basic script is:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -o myjob.out.%j
#SBATCH -e myjob.err.%j
#SBATCH -N 1
#SBATCH -n 1
#SBATCH --ntasks-per-node=1
#SBATCH -p all
#SBATCH --exclusive
#SBATCH -t 00:10:00
echo "###"
date
echo "###"
echo "Hello World ! "
hostname
sleep 30s
echo "###"
date
echo "###"
It is very important to understand Slurm parameters here:
* -J
is to set the name of the job
* -o
to set the output file of the job
* -e
to set the error output file of the job
* -p
to select partition to use (optional)
* --exclusive
to specify nodes used must not be shared with other users (optional)
* -t
to specify the maximum time allocated to the job (job will be killed if it goes beyond, beware). Using a small time allow to be able to run a job quickly in the waiting queue, using a large time will force to wait more
* -N
, -n
and --ntasks-per-node
were already described.
To submit this script, user needs to use sbatch:
If the script syntax is ok, sbatch
will return a job id number. This number can be used to follow the job progress, using squeue
(assuming job number is 91487):
Check under ST the status of the job. PD (pending), R (running), CA (cancelled), CG (completing), CD (completed), F (failed), TO (timeout), and NF (node failure).
It is also possible to check all user jobs running:
In this example, execution results will be written by Slurm into myjob.out.91487
and myjob.err.91487
.
Serial job
To launch a very basic serial job, use the following template as a script for sbatch
:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -o myjob.out.%j
#SBATCH -e myjob.err.%j
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH -t 03:00:00
echo "############### START #######"
date
echo "############### "
/home/myuser/./myexecutable.exe
echo "############### END #######"
date
echo "############### "
OpenMP job
To launch an OpenMP job (with multithreads), assuming the code was compiled with openmp flags, use:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -o myjob.out.%j
#SBATCH -e myjob.err.%j
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH -t 03:00:00
## If compute node has 24 cores
export OMP_NUM_THREADS=24
## If needed, to be tuned to needs
export OMP_SCHEDULE="dynamic, 100"
echo "############### START #######"
date
echo "############### "
/home/myuser/./myparaexecutable.exe
echo "############### END #######"
date
echo "############### "
Note that it is assumed here that a node has 24 cores.
MPI job
To submit an MPI job, assuming the code was parallelized with MPI and compile with MPI, use (note the srun
, replacing the mpirun
):
#!/bin/bash
#SBATCH -J myjob
#SBATCH -o myjob.out.%j
#SBATCH -e myjob.err.%j
#SBATCH -N 4
#SBATCH --ntasks-per-node=24
#SBATCH --exclusive
#SBATCH -t 03:00:00
echo "############### START #######"
date
echo "############### "
srun /home/myuser/./mympiexecutable.exe
echo "############### END #######"
date
echo "############### "
srun
will act as mpirun
, but providing automatically all already tuned arguments for the cluster.
Real life example with Blender job
Blender animations/movies are render using CPU and GPU. In this tutorial, we will focus on CPU since we do not have GPU (or if you have, lucky you).
We will render an animation of 40 frames.
We could create a simple job, asking Blender to render this animation. But Blender will then use a single compute node. We have a cluster at disposal, lets take advantage of that.
We will use Slurm job arrays (so an array of jobs) to split these 40 frames into chuck of 5 frames. Each chuck will be a unique job. Using this method, we will use all available computes nodes of our small cluster.
Note that 5 is an arbitrary number, and this depend of how difficult to render each frame is. If a unique frame takes 10 minutes to render, then you can create chink of 1 frame. If on the other hand each frame takes 10s to render, it is better to group them by chunk as Blender as a "starting time" for each new chunk.
First download Blender and the demo:
wget https://download.blender.org/demo/geometry-nodes/candy_bounce_geometry-nodes_demo.blend
wget https://ftp.nluug.nl/pub/graphics/blender/release/Blender2.93/blender-2.93.1-linux-x64.tar.xz
Extract Blender into /software
and copy demo file into /home
:
cp candy_bounce_geometry-nodes_demo.blend /home
tar xJvf blender-2.93.1-linux-x64.tar.xz -C /software
Now lets create the job file. Create file /home/blender_job.job
with the following content:
#!/bin/bash
#SBATCH -J myjob
#SBATCH -o myjob.out.%j
#SBATCH -e myjob.err.%j
#SBATCH -N 1
#SBATCH --ntasks-per-node=1
#SBATCH --exclusive
#SBATCH -t 01:00:00
set -x
# We set chunk size to 5 frames
chunk_size=5
# We calculate frames span for each job depending of ARRAY_TASK_ID
range_min=$(((SLURM_ARRAY_TASK_ID-1)*chunk_size+1))
range_max=$(((SLURM_ARRAY_TASK_ID)*chunk_size))
# We include blender binary into our PATH
export PATH=/software/blender-2.93.1-linux-x64/:$PATH
# We start the render
# -b is the input blender file
# -o is the output target folder, with files format
# -F is the output format
# -f specify the range
# -noaudio is self explaining
# IMPORTANT: Blender arguments must be given in that specific order.
eval blender -b /home/candy_bounce_geometry-nodes_demo.blend -o /home/frame##### -F PNG -f $range_min..$range_max -noaudio
# Note: if you have issues with default engine, try using CYCLES. Slower.
# eval blender -b /home/candy_bounce_geometry-nodes_demo.blend -E CYCLES -o /home/frame##### -F PNG -f $range_min..$range_max -noaudio
This job file will be executed for each job.
Since we have 40 frames to render, and we create 5 frames chunk, this means we need to ask Slurm to create a job array of 40/5=8
jobs.
Launch the array of jobs:
If all goes well, using squeue
command, you should be able to see the jobs currently running, and the ones currently pending for resources.
You can follow jobs by watching their job file (refreshed by Slurm regularly). And after few seconds/minutes depending of your hardware, you should see first animation frames as PNG images in /home folder. You can use any kind of image/movie tool to generate a movie from these images.
This example shows how to use Slurm to create a Blender render farm.