Differences

This shows you the differences between two versions of the page.

--- slurm [2022/08/29 16:40]
vodrazka created
+++ slurm [2024/01/09 19:54] (current)
popel
@@ Line 1: / Line 1: @@
 ====== ÚFAL Grid Engine (LRC) ======
-LRC (Linguistic Research Cluster) is a name of ÚFAL's computational grid/cluster.
+LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
+See Milan Straka's intro to Slurm (and Spark and possibly also the [[https://ufal.mff.cuni.cz/courses/npfl118#assignments|NPFL118 assingments]] if you want). Use the username=ufal and small linguistic password:
+  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-slurm.mp4
+  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-spark.mp4
+  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-assignments.mp4
+Currently there are following partitions (queues) available for computing:
+===== Node list by partitions =====
+The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-**X**gpu**N** where **X** gives total number of GPUs equipped and **N** is just enumerating the order of the node with the given configuration.
+The prefix **t** is for nodes at Troja and **dll** stands for Deep Learning Laboratory.
+==== cpu-troja ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) |
+| achilles[1-8] | 32 | 2:8:2 | 128810 |
+| hector[1-8] | 32 | 2:8:2 | 128810 |
+| helena[1-8] | 32 | 2:8:2 | 128811 |
+| paris[1-8] | 32 | 2:8:2 | 128810 |
+| hyperion[2-8] | 64 | 2:16:2 | 257667 |
+==== cpu-ms ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) |
+| iridium | 16 | 2:4:2 | 515977 |
+| orion[1-8] | 40 | 2:10:2 | 128799 |
+==== gpu-troja ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
+| tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| tdll-8gpu[1,2] | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 |
+| tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
+==== gpu-ms ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
+| dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 |
+| dll-4gpu3 | 62 | 1:32:2 | 515652 | gpuram48G gpu_cc8.9 | NVIDIA L40 |
+| dll-4gpu4 | 30 | 1:16:2 | 257616 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 |
+| dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
+| dll-8gpu[5,6] | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |
+| dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
+| dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
+==== Submit nodes ====
+In order to submit a job you need to login to one of the head nodes:
+   lrc1.ufal.hide.ms.mff.cuni.cz
+   lrc2.ufal.hide.ms.mff.cuni.cz
+   sol1.ufal.hide.ms.mff.cuni.cz
+   sol2.ufal.hide.ms.mff.cuni.cz
+   sol3.ufal.hide.ms.mff.cuni.cz
+   sol4.ufal.hide.ms.mff.cuni.cz
 ===== Basic usage =====
+==== Batch mode ====
+The core idea is that you write a batch script containing the commands you wish to run as well as a list of ''SBATCH'' directives specifying the resources or parameters that you need for your job.
+Then the script is submitted to the cluster with:
+<code>sbatch myJobScript.sh</code>
+Here is a simple working example:
+<code>
+#!/bin/bash
+#SBATCH -J helloWorld					  # name of job
+#SBATCH -p cpu-troja					  # name of partition or queue (default=cpu-troja)
+#SBATCH -o helloWorld.out				  # name of output file for this submission script
+#SBATCH -e helloWorld.err				  # name of error file for this submission script
+# run my job (some executable)
+sleep 5
+echo "Hello I am running on cluster!"
+</code>
+After submitting this simple code you should end up with the two files (''helloWorld.out'' and ''helloWorld.err'') in the directory where you called the ''sbatch'' command.
+Here is the list of other useful ''SBATCH'' directives:
+<code>
+#SBATCH -D /some/path/                        # change directory before executing the job
+#SBATCH -N 2                                  # number of nodes (default 1)
+#SBATCH --nodelist=node1,node2...             # execute on *all* the specified nodes (and possibly more)
+#SBATCH --cpus-per-task=4                     # number of cores/threads per task (default 1)
+#SBATCH --gres=gpu:1                          # number of GPUs to request (default 0)
+#SBATCH --mem=10G                             # request 10 gigabytes memory (per node, default depends on node)
+</code>
+If you need you can have slurm report to you:
+<code>
+#SBATCH --mail-type=begin        # send email when job begins
+#SBATCH --mail-type=end          # send email when job ends
+#SBATCH --mail-type=fail         # send email if job fails
+#SBATCH --mail-user=<YourUFALEmailAccount>
+</code>
+As usuall the complete set of options can be found by typing:
+<code>
+man sbatch
+</code>
+=== Rudolf's template ===
+The main point is for log files to have the job name and job id in them automatically.
+<code>
+#SBATCH -J RuRjob
+#SBATCH -o %x.%j.out
+#SBATCH -e %x.%j.err
+#SBATCH -p gpu-troja
+#SBATCH --gres=gpu:1
+#SBATCH --mem=16G
+#SBATCH --constraint="gpuram16G|gpuram24G"
+# Print each command to STDERR before executing (expanded), prefixed by "+ "
+set -o xtrace
+</code>
+==== Inspecting jobs ====
+In order to inspect all running jobs on the cluster use:
+<code>
+squeue
+</code>
+filter only jobs of user ''linguist'':
+<code>
+squeue -u linguist
+</code>
+filter only jobs on partition ''gpu-ms'':
+<code>
+squeue -p gpu-ms
+</code>
+filter jobs in specific state (see ''man squeue'' for list of valid job states):
+<code>
+squeue -t RUNNING
+</code>
+filter jobs running on a specific node:
+<code>
+squeue -w dll-3gpu1
+</code>
+==== Cluster info ====
+The command ''sinfo'' can give you useful information about nodes available in the cluster. Here is a short list of some examples:
+List available partitions(queues). The default partition is marked with ''*'':
+<code>
+sinfo
+</code>
+List detailed info about nodes:
+<code>
+sinfo -l -N
+</code>
+List nodes with some custom format info:
+<code>
+sinfo -N -o "%N %P %.11T %.15f"
+</code>
+=== CPU core allocation ===
+The minimal computing resource in SLURM is one CPU core. However, CPU count advertised by SLURM corresponds to the number of CPU threads.
+If you ask for 1 CPU core with <code>--cpus-per-task=1</code> SLURM will allocate all threads of 1 CPU core.
+For example ''dll-8gpu1'' will allocate 2 threads since its ThreadsPerCore=2:
+<code>
+$> scontrol show node dll-8gpu1
+$ scontrol show node dll-8gpu1
+NodeName=dll-8gpu1 Arch=x86_64 CoresPerSocket=16
+   CPUAlloc=0 CPUTot=64 CPULoad=0.05                                               // CPUAlloc - allocated threads, CPUTot - total threads
+   AvailableFeatures=gpuram24G
+   ActiveFeatures=gpuram24G
+   Gres=gpu:nvidia_a30:8(S:0-1)
+   NodeAddr=10.10.24.63 NodeHostName=dll-8gpu1 Version=21.08.8-2
+   OS=Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-3 (Wed, 11 May 2022 07:57:51 +0200)
+   RealMemory=515838 AllocMem=0 FreeMem=507650 Sockets=2 Boards=1
+   CoreSpecCount=1 CPUSpecList=62-63                                               // CoreSpecCount - cores reserved for OS, CPUSpecList - list of threads reserved for system
+   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A          // ThreadsPerCore - count of threads for 1 CPU core
+   Partitions=gpu-ms
+   BootTime=2022-09-01T14:07:50 SlurmdStartTime=2022-09-02T13:54:05
+   LastBusyTime=2022-10-02T20:17:09
+   CfgTRES=cpu=64,mem=515838M,billing=64
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
+</code>
+In the example above you can see comments at all lines relevant to CPU allocation.
+=== Priority ====
+When running srun or sbatch, you can pass ''-q high/normal/low/preempt-low''. These represent priorities 300/200/100/100, with ''normal'' (200) being the default. Furthermore, the ''preempt-low'' QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your ''preempt-low'' job.
+The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources.
+==== Interactive mode ====
+This mode can be useful for testing You should be using batch mode for any serious computation.
+You can use **''srun''** command to get an interactive shell on an arbitrary node from the default partition (queue):
+<code>srun --pty bash</code>
+There are many more parameters available to use. For example:
+**To get an interactive CPU job with 64GB of reserved memory:**
+<code>srun -p cpu-troja,cpu-ms --mem=64G --pty bash</code>
+  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition.
+  * ''-''''-mem=64G'' requires 64G of memory for the job
+**To get interactive job with a single GPU of any kind:**
+<code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code>
+  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions
+  * ''-''''-gres=gpu:1'' requires 1 GPUs
+<code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code>
+  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions
+  * ''-''''-nodelist=tdll-3gpu1'' explicitly requires one specific node
+  * Note that e.g. ''-''''-nodelist=tdll-3gpu[1-4]'' would execute 4 jobs on **all** the four machines ''tdll-3gpu[1-4]''. The documentation says "The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements." I am not aware of any [[https://stackoverflow.com/a/37555321/3310232|simple way]] how to specify that **any** of the listed nodes can be used, i.e. an equivalent of SGE ''-q '*@hector[14]'''.
+  * ''-''''-gres=gpu:2'' requires 2 GPUs
+<code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code>
+  * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined
+\\
+**Unexpected Behavior of ''srun -c1''**
+When you execute a command using ''srun'' and pass ''-c1'' like
+<code>srun -c1 date</code>
+then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the ''-c1''** or also **add explicit ''-n1''.**
+==== Delete Job ====
+<code>scancel <job_id> </code>
+<code>scancel -n <job_name> </code>
+To see all the available options type:
+<code>man scancel</code>
+==== Basic commands on cluster machines ====
+  lspci
+    # is any such hardware there?
+  nvidia-smi
+    # more details, incl. running processes on the GPU
+    # nvidia-* are typically located in /usr/bin
+  watch nvidia-smi
+    # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!)
+    # You can also use nvidia-smi -l TIME
+  nvcc --version
+    # this should tell CUDA version
+    # nvcc is typically installed in /usr/local/cuda/bin/
+  theano-test
+    # dela to vubec neco uzitecneho? :-)
+    # theano-* are typically located in /usr/local/bin/
+  /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
+    # shows CUDA capability etc.
+  ssh dll1; ~popel/bin/gpu_allocations
+    # who occupies which card on a given machine
+===== See also =====
+https://www.msi.umn.edu/slurm/pbs-conversion

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences