[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
slurm [2022/10/25 15:26]
vodrazka [Submit nodes]
slurm [2023/04/05 13:39]
vodrazka [gpu-ms]
Line 2: Line 2:
  
 LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]]. LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
 +
 +See Milan Straka's intro to Slurm (and Spark if you want):
 +
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-slurm.mp4
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-spark.mp4
  
 Currently there are following partitions (queues) available for computing: Currently there are following partitions (queues) available for computing:
Line 7: Line 12:
 ===== Node list by partitions ===== ===== Node list by partitions =====
  
 +The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-**X**gpu**N** where **X** gives total number of GPUs equipped and **N** is just enumerating the order of the node with the given configuration.
 +The prefix **t** is for nodes at Troja and **dll** stands for Deep Learning Laboratory. 
 ==== cpu-troja ==== ==== cpu-troja ====
  
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | | Node name | Thread count | Socket:Core:Thread | RAM (MB) |
-achilles1 | 32 | 2:8:2 | 128810 | +achilles[1-8| 32 | 2:8:2 | 128810 | 
-| achilles2 | 32 | 2:8:2 | 128810 | +hector[1-8| 32 | 2:8:2 | 128810 | 
-achilles3 | 32 | 2:8:2 | 128810 | +helena[1-8| 32 | 2:8:2 | 128811 | 
-| achilles4 | 32 | 2:8:2 | 128810 | +paris[1-8| 32 | 2:8:2 | 128810 | 
-achilles5 | 32 | 2:8:2 | 128810 | +hyperion[2-8| 64 | 2:16:2 | 257667 |
-| achilles6 | 32 | 2:8:2 | 128810 | +
-| achilles7 | 32 | 2:8:2 | 128810 | +
-| achilles8 | 32 | 2:8:2 | 128810 | +
-| hector1 | 32 | 2:8:2 | 128810 | +
-| hector2 | 32 | 2:8:2 | 128810 | +
-| hector3 | 32 | 2:8:2 | 128810 | +
-| hector4 | 32 | 2:8:2 | 128810 | +
-| hector5 | 32 | 2:8:2 | 128810 | +
-| hector6 | 32 | 2:8:2 | 128810 | +
-| hector7 | 32 | 2:8:2 | 128810 | +
-| hector8 | 32 | 2:8:2 | 128810 | +
-| helena1 | 32 | 2:8:2 | 128811 | +
-helena2 | 32 | 2:8:2 | 128811 | +
-| helena3 | 32 | 2:8:2 | 128811 | +
-| helena4 | 32 | 2:8:2 | 128811 | +
-| helena5 | 32 | 2:8:2 | 128810 | +
-helena6 | 32 | 2:8:2 | 128811 | +
-| helena7 | 32 | 2:8:2 | 128810 | +
-| helena8 | 32 | 2:8:2 | 128811 | +
-| paris1 | 32 | 2:8:2 | 128810 | +
-| paris2 | 32 | 2:8:2 | 128810 | +
-| paris3 | 32 | 2:8:2 | 128810 | +
-| paris4 | 32 | 2:8:2 | 128810 | +
-| paris5 | 32 | 2:8:2 | 128810 | +
-| paris6 | 32 | 2:8:2 | 128810 | +
-| paris7 | 32 | 2:8:2 | 128810 | +
-| paris8 | 32 | 2:8:2 | 128810 | +
-| hyperion2 | 64 | 2:16:2 | 257667 | +
-| hyperion3 | 64 | 2:16:2 | 257667 | +
-| hyperion4 | 64 | 2:16:2 | 257667 | +
-| hyperion5 | 64 | 2:16:2 | 257667 | +
-| hyperion6 | 64 | 2:16:2 | 257667 | +
-| hyperion7 | 64 | 2:16:2 | 257667 | +
-| hyperion8 | 64 | 2:16:2 | 257667 |+
 ==== cpu-ms ==== ==== cpu-ms ====
  
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | | Node name | Thread count | Socket:Core:Thread | RAM (MB) |
 | iridium | 16 | 2:4:2 | 515977 | | iridium | 16 | 2:4:2 | 515977 |
-orion1 | 40 | 2:10:2 | 128799 | +orion[1-8] | 40 | 2:10:2 | 128799 |
-| orion2 | 40 | 2:10:2 | 128799 | +
-| orion3 | 40 | 2:10:2 | 128799 | +
-| orion4 | 40 | 2:10:2 | 128799 | +
-| orion5 | 40 | 2:10:2 | 128799 | +
-| orion6 | 40 | 2:10:2 | 128799 | +
-| orion7 | 40 | 2:10:2 | 128799 | +
-| orion8 | 40 | 2:10:2 | 128799 |+
 ==== gpu-troja ==== ==== gpu-troja ====
  
-| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | +| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type 
-| tdll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 +| tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 
-| tdll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | +| tdll-8gpu[1,2| 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 
-| tdll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | +| tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 +
-| tdll-8gpu1 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | +
-| tdll-8gpu2 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | +
-| tdll-8gpu3 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 +
-| tdll-8gpu4 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | +
-| tdll-8gpu5 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | +
-| tdll-8gpu6 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | +
-| tdll-8gpu7 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 |+
 ==== gpu-ms ==== ==== gpu-ms ====
  
-| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | +| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type 
-| dll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 +| dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 
-| dll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | +| dll-4gpu[1,2| 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 
-| dll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | +| dll-4gpu3 62 1:32:2 | 515652 gpuram48G gpu_cc8.9 | NVIDIA L40 
-| dll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 +| dll-4gpu4 30 1:16:2 | 257616 gpuram48G gpu_cc8.6 | NVIDIA A40 
-| dll-3gpu5 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | +| dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 
-| dll-4gpu1 | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | +| dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 
-| dll-4gpu2 40 2:10:2 | 187978 gpuram24G gpu_cc8.+| dll-8gpu[5,6| 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 
-| dll-8gpu1 64 2:16:2 | 515838 gpuram24G gpu_cc8.+| dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 
-| dll-8gpu2 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | +| dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
-| dll-8gpu3 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | +
-| dll-8gpu4 | 32 | 2:8:2 | 253721 | gpuram16G gpu_cc8.+
-| dll-8gpu5 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | +
-| dll-8gpu6 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 +
-| dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | +
-| dll-10gpu2 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | +
-| dll-10gpu3 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 |+
  
  
Line 121: Line 72:
 #!/bin/bash #!/bin/bash
 #SBATCH -J helloWorld   # name of job #SBATCH -J helloWorld   # name of job
-#SBATCH -p cpu-troja   # name of partition or queue (if not specified default partition is used)+#SBATCH -p cpu-troja   # name of partition or queue (default=cpu-troja)
 #SBATCH -o helloWorld.out   # name of output file for this submission script #SBATCH -o helloWorld.out   # name of output file for this submission script
 #SBATCH -e helloWorld.err   # name of error file for this submission script #SBATCH -e helloWorld.err   # name of error file for this submission script
Line 136: Line 87:
 #SBATCH -D /some/path/                        # change directory before executing the job    #SBATCH -D /some/path/                        # change directory before executing the job   
 #SBATCH -N 2                                  # number of nodes (default 1) #SBATCH -N 2                                  # number of nodes (default 1)
-#SBATCH --nodelist=node1,node2...             # required node, or comma separated list of required nodes+#SBATCH --nodelist=node1,node2...             # execute on *all* the specified nodes (and possibly more)
 #SBATCH --cpus-per-task=4                     # number of cores/threads per task (default 1) #SBATCH --cpus-per-task=4                     # number of cores/threads per task (default 1)
 #SBATCH --gres=gpu:                         # number of GPUs to request (default 0) #SBATCH --gres=gpu:                         # number of GPUs to request (default 0)
Line 155: Line 106:
 <code> <code>
 man sbatch man sbatch
 +</code>
 +
 +=== Rudolf's template ===
 +
 +The main point is for log files to have the job name and job id in them automatically.
 +
 +<code>
 +#SBATCH -J RuRjob
 +#SBATCH -o %x.%j.out
 +#SBATCH -e %x.%j.err
 +#SBATCH -p gpu-troja
 +#SBATCH --gres=gpu:1
 +#SBATCH --mem=16G
 +#SBATCH --constraint="gpuram16G|gpuram24G"
 +
 +# Print each command to STDERR before executing (expanded), prefixed by "+ "
 +set -o xtrace
 </code> </code>
  
Line 254: Line 222:
  
   * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition.   * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition.
-  * ''--mem=64G'' requires 64G of memory for the job+  * ''-''''-mem=64G'' requires 64G of memory for the job
  
 **To get interactive job with a single GPU of any kind:** **To get interactive job with a single GPU of any kind:**
 <code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code> <code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code>
   * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions   * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions
-  * ''--gres=gpu:1'' requires 1 GPUs+  * ''-''''-gres=gpu:1'' requires 1 GPUs
  
 <code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code> <code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code>
   * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions   * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions
-  * ''--nodelist=tdll-3gpu1'' explicitly requires one specific node +  * ''-''''-nodelist=tdll-3gpu1'' explicitly requires one specific node 
-  * ''--gres=gpu:2'' requires 2 GPUs+  * Note that e.g. ''-''''-nodelist=tdll-3gpu[1-4]'' would execute 4 jobs on **all** the four machines ''tdll-3gpu[1-4]''. The documentation says "The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements." I am not aware of any [[https://stackoverflow.com/a/37555321/3310232|simple way]] how to specify that **any** of the listed nodes can be used, i.e. an equivalent of SGE ''-q '*@hector[14]'''
 +  * ''-''''-gres=gpu:2'' requires 2 GPUs
  
 <code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code> <code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code>
-  * ''--constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined+  * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined 
 + 
 +==== Delete Job ==== 
 +<code>scancel <job_id> </code>
  
 To see all the available options type: To see all the available options type:
  
 <code>man srun</code> <code>man srun</code>
 +
 +==== Basic commands on cluster machines ====
 +
 +  lspci
 +    # is any such hardware there?
 +  nvidia-smi
 +    # more details, incl. running processes on the GPU
 +    # nvidia-* are typically located in /usr/bin
 +  watch nvidia-smi
 +    # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!)
 +    # You can also use nvidia-smi -l TIME
 +  nvcc --version
 +    # this should tell CUDA version
 +    # nvcc is typically installed in /usr/local/cuda/bin/
 +  theano-test
 +    # dela to vubec neco uzitecneho? :-)
 +    # theano-* are typically located in /usr/local/bin/
 +  /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
 +    # shows CUDA capability etc.
 +  ssh dll1; ~popel/bin/gpu_allocations
 +    # who occupies which card on a given machine
 +    
 +
  
 ===== See also ===== ===== See also =====

[ Back to the navigation ] [ Back to the content ]