[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm [2022/08/31 13:51]
vodrazka [Batch mode]
slurm [2024/01/09 19:54] (current)
popel
Line 1: Line 1:
 ====== ÚFAL Grid Engine (LRC) ====== ====== ÚFAL Grid Engine (LRC) ======
  
-LRC (Linguistic Research Cluster) is name of ÚFAL's computational grid/cluster.+LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
  
 +See Milan Straka's intro to Slurm (and Spark and possibly also the [[https://ufal.mff.cuni.cz/courses/npfl118#assignments|NPFL118 assingments]] if you want). Use the username=ufal and small linguistic password:
 +
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-slurm.mp4
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-spark.mp4
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-assignments.mp4
 +
 +Currently there are following partitions (queues) available for computing:
 +
 +===== Node list by partitions =====
 +
 +The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-**X**gpu**N** where **X** gives total number of GPUs equipped and **N** is just enumerating the order of the node with the given configuration.
 +The prefix **t** is for nodes at Troja and **dll** stands for Deep Learning Laboratory. 
 +==== cpu-troja ====
 +
 +| Node name | Thread count | Socket:Core:Thread | RAM (MB) |
 +| achilles[1-8] | 32 | 2:8:2 | 128810 |
 +| hector[1-8] | 32 | 2:8:2 | 128810 |
 +| helena[1-8] | 32 | 2:8:2 | 128811 |
 +| paris[1-8] | 32 | 2:8:2 | 128810 |
 +| hyperion[2-8] | 64 | 2:16:2 | 257667 |
 +==== cpu-ms ====
 +
 +| Node name | Thread count | Socket:Core:Thread | RAM (MB) |
 +| iridium | 16 | 2:4:2 | 515977 |
 +| orion[1-8] | 40 | 2:10:2 | 128799 |
 +==== gpu-troja ====
 +
 +| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
 +| tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
 +| tdll-8gpu[1,2] | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 |
 +| tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
 +==== gpu-ms ====
 +
 +| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
 +| dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
 +| dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 |
 +| dll-4gpu3 | 62 | 1:32:2 | 515652 | gpuram48G gpu_cc8.9 | NVIDIA L40 |
 +| dll-4gpu4 | 30 | 1:16:2 | 257616 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
 +| dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 |
 +| dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
 +| dll-8gpu[5,6] | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |
 +| dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
 +| dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
 +
 +
 +==== Submit nodes ====
 +
 +
 +In order to submit a job you need to login to one of the head nodes:
 +
 +   lrc1.ufal.hide.ms.mff.cuni.cz
 +   lrc2.ufal.hide.ms.mff.cuni.cz
 +   sol1.ufal.hide.ms.mff.cuni.cz
 +   sol2.ufal.hide.ms.mff.cuni.cz
 +   sol3.ufal.hide.ms.mff.cuni.cz
 +   sol4.ufal.hide.ms.mff.cuni.cz
 ===== Basic usage ===== ===== Basic usage =====
  
Line 17: Line 73:
 #!/bin/bash #!/bin/bash
 #SBATCH -J helloWorld   # name of job #SBATCH -J helloWorld   # name of job
-#SBATCH -p cpu-troja   # name of partition or queue+#SBATCH -p cpu-troja   # name of partition or queue (default=cpu-troja)
 #SBATCH -o helloWorld.out   # name of output file for this submission script #SBATCH -o helloWorld.out   # name of output file for this submission script
 #SBATCH -e helloWorld.err   # name of error file for this submission script #SBATCH -e helloWorld.err   # name of error file for this submission script
Line 28: Line 84:
 After submitting this simple code you should end up with the two files (''helloWorld.out'' and ''helloWorld.err'') in the directory where you called the ''sbatch'' command. After submitting this simple code you should end up with the two files (''helloWorld.out'' and ''helloWorld.err'') in the directory where you called the ''sbatch'' command.
  
-Here is the list of other useful ''SBATCH'' directives ):+Here is the list of other useful ''SBATCH'' directives:
 <code> <code>
 +#SBATCH -D /some/path/                        # change directory before executing the job   
 #SBATCH -N 2                                  # number of nodes (default 1) #SBATCH -N 2                                  # number of nodes (default 1)
-#SBATCH --nodelist=node1,node2...             # required node, or comma separated list of required nodes +#SBATCH --nodelist=node1,node2...             # execute on *all* the specified nodes (and possibly more) 
-#SBATCH -                                 # number of cores/threads per task (default 1)+#SBATCH --cpus-per-task=                    # number of cores/threads per task (default 1)
 #SBATCH --gres=gpu:                         # number of GPUs to request (default 0) #SBATCH --gres=gpu:                         # number of GPUs to request (default 0)
 #SBATCH --mem=10G                             # request 10 gigabytes memory (per node, default depends on node) #SBATCH --mem=10G                             # request 10 gigabytes memory (per node, default depends on node)
 </code> </code>
 +
 +If you need you can have slurm report to you:
 +
 +<code>
 +#SBATCH --mail-type=begin        # send email when job begins
 +#SBATCH --mail-type=end          # send email when job ends
 +#SBATCH --mail-type=fail         # send email if job fails
 +#SBATCH --mail-user=<YourUFALEmailAccount>
 +</code>
 +
 +As usuall the complete set of options can be found by typing:
 +
 +<code>
 +man sbatch
 +</code>
 +
 +=== Rudolf's template ===
 +
 +The main point is for log files to have the job name and job id in them automatically.
 +
 +<code>
 +#SBATCH -J RuRjob
 +#SBATCH -o %x.%j.out
 +#SBATCH -e %x.%j.err
 +#SBATCH -p gpu-troja
 +#SBATCH --gres=gpu:1
 +#SBATCH --mem=16G
 +#SBATCH --constraint="gpuram16G|gpuram24G"
 +
 +# Print each command to STDERR before executing (expanded), prefixed by "+ "
 +set -o xtrace
 +</code>
 +
 +==== Inspecting jobs ====
 +
 +In order to inspect all running jobs on the cluster use:
 +
 +<code>
 +squeue
 +</code>
 +
 +filter only jobs of user ''linguist'':
 +
 +<code>
 +squeue -u linguist
 +</code>
 +
 +filter only jobs on partition ''gpu-ms'':
 +
 +<code>
 +squeue -p gpu-ms
 +</code>
 +
 +filter jobs in specific state (see ''man squeue'' for list of valid job states):
 +<code>
 +squeue -t RUNNING
 +</code>
 +
 +filter jobs running on a specific node:
 +<code>
 +squeue -w dll-3gpu1
 +</code>
 +
 +==== Cluster info ====
 +
 +The command ''sinfo'' can give you useful information about nodes available in the cluster. Here is a short list of some examples:
 +
 +List available partitions(queues). The default partition is marked with ''*'':
 +<code>
 +sinfo
 +</code>
 +
 +List detailed info about nodes:
 +<code>
 +sinfo -l -N
 +</code> 
 +
 +List nodes with some custom format info:
 +<code>
 +sinfo -N -o "%N %P %.11T %.15f"
 +</code>
 +
 +=== CPU core allocation ===
 +
 +The minimal computing resource in SLURM is one CPU core. However, CPU count advertised by SLURM corresponds to the number of CPU threads.
 +If you ask for 1 CPU core with <code>--cpus-per-task=1</code> SLURM will allocate all threads of 1 CPU core.
 +
 +For example ''dll-8gpu1'' will allocate 2 threads since its ThreadsPerCore=2:
 +
 +<code>
 +$> scontrol show node dll-8gpu1
 +$ scontrol show node dll-8gpu1
 +NodeName=dll-8gpu1 Arch=x86_64 CoresPerSocket=16 
 +   CPUAlloc=0 CPUTot=64 CPULoad=0.05                                               // CPUAlloc - allocated threads, CPUTot - total threads
 +   AvailableFeatures=gpuram24G
 +   ActiveFeatures=gpuram24G
 +   Gres=gpu:nvidia_a30:8(S:0-1)
 +   NodeAddr=10.10.24.63 NodeHostName=dll-8gpu1 Version=21.08.8-2
 +   OS=Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-3 (Wed, 11 May 2022 07:57:51 +0200) 
 +   RealMemory=515838 AllocMem=0 FreeMem=507650 Sockets=2 Boards=1
 +   CoreSpecCount=1 CPUSpecList=62-63                                               // CoreSpecCount - cores reserved for OS, CPUSpecList - list of threads reserved for system
 +   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/         // ThreadsPerCore - count of threads for 1 CPU core
 +   Partitions=gpu-ms 
 +   BootTime=2022-09-01T14:07:50 SlurmdStartTime=2022-09-02T13:54:05
 +   LastBusyTime=2022-10-02T20:17:09
 +   CfgTRES=cpu=64,mem=515838M,billing=64
 +   AllocTRES=
 +   CapWatts=n/a
 +   CurrentWatts=0 AveWatts=0
 +   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
 +</code>
 +
 +In the example above you can see comments at all lines relevant to CPU allocation.
 +
 +=== Priority ====
 +
 +When running srun or sbatch, you can pass ''-q high/normal/low/preempt-low''. These represent priorities 300/200/100/100, with ''normal'' (200) being the default. Furthermore, the ''preempt-low'' QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your ''preempt-low'' job.
 +
 +The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources.
  
 ==== Interactive mode ==== ==== Interactive mode ====
Line 46: Line 222:
 There are many more parameters available to use. For example: There are many more parameters available to use. For example:
  
-<code>srun -p cpu-troja --mem=64G --pty bash</code>+**To get an interactive CPU job with 64GB of reserved memory:** 
 +<code>srun -p cpu-troja,cpu-ms --mem=64G --pty bash</code> 
 + 
 +  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition. 
 +  * ''-''''-mem=64G'' requires 64G of memory for the job 
 + 
 +**To get interactive job with a single GPU of any kind:** 
 +<code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code> 
 +  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions 
 +  * ''-''''-gres=gpu:1'' requires 1 GPUs 
 + 
 +<code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code> 
 +  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions 
 +  * ''-''''-nodelist=tdll-3gpu1'' explicitly requires one specific node 
 +  * Note that e.g. ''-''''-nodelist=tdll-3gpu[1-4]'' would execute 4 jobs on **all** the four machines ''tdll-3gpu[1-4]''. The documentation says "The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements." I am not aware of any [[https://stackoverflow.com/a/37555321/3310232|simple way]] how to specify that **any** of the listed nodes can be used, i.e. an equivalent of SGE ''-q '*@hector[14]'''
 +  * ''-''''-gres=gpu:2'' requires 2 GPUs 
 + 
 +<code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code> 
 +  * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined 
 + 
 + 
 +\\ 
 +**Unexpected Behavior of ''srun -c1''** 
 +When you execute a command using ''srun'' and pass ''-c1'' like 
 +<code>srun -c1 date</code> 
 +then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the ''-c1''** or also **add explicit ''-n1''.** 
 +==== Delete Job ==== 
 +<code>scancel <job_id> </code> 
 + 
 +<code>scancel -n <job_name> </code>
  
-Where: 
-  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja'' 
-  * ''--mem=64G'' requires 64G of memory for the job 
  
 To see all the available options type: To see all the available options type:
  
-<code>man srun</code>+<code>man scancel</code> 
 + 
 +==== Basic commands on cluster machines ==== 
 + 
 +  lspci 
 +    # is any such hardware there? 
 +  nvidia-smi 
 +    # more details, incl. running processes on the GPU 
 +    # nvidia-* are typically located in /usr/bin 
 +  watch nvidia-smi 
 +    # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!) 
 +    # You can also use nvidia-smi -l TIME 
 +  nvcc --version 
 +    # this should tell CUDA version 
 +    # nvcc is typically installed in /usr/local/cuda/bin/ 
 +  theano-test 
 +    # dela to vubec neco uzitecneho? :-) 
 +    # theano-* are typically located in /usr/local/bin/ 
 +  /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery 
 +    # shows CUDA capability etc. 
 +  ssh dll1; ~popel/bin/gpu_allocations 
 +    # who occupies which card on a given machine 
 +     
 + 
 + 
 +===== See also ===== 
 + 
 +https://www.msi.umn.edu/slurm/pbs-conversion 
  

[ Back to the navigation ] [ Back to the content ]