[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
slurm [2023/01/16 20:55]
popel --nodelist problems
slurm [2023/09/26 17:16]
straka [Interactive mode]
Line 2: Line 2:
  
 LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]]. LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
 +
 +See Milan Straka's intro to Slurm (and Spark if you want):
 +
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-slurm.mp4
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-spark.mp4
  
 Currently there are following partitions (queues) available for computing: Currently there are following partitions (queues) available for computing:
Line 12: Line 17:
  
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | | Node name | Thread count | Socket:Core:Thread | RAM (MB) |
-achilles1 | 32 | 2:8:2 | 128810 | +achilles[1-8| 32 | 2:8:2 | 128810 | 
-| achilles2 | 32 | 2:8:2 | 128810 | +hector[1-8| 32 | 2:8:2 | 128810 | 
-achilles3 | 32 | 2:8:2 | 128810 | +helena[1-8| 32 | 2:8:2 | 128811 | 
-| achilles4 | 32 | 2:8:2 | 128810 | +paris[1-8| 32 | 2:8:2 | 128810 | 
-achilles5 | 32 | 2:8:2 | 128810 | +hyperion[2-8| 64 | 2:16:2 | 257667 |
-| achilles6 | 32 | 2:8:2 | 128810 | +
-| achilles7 | 32 | 2:8:2 | 128810 | +
-| achilles8 | 32 | 2:8:2 | 128810 | +
-| hector1 | 32 | 2:8:2 | 128810 | +
-| hector2 | 32 | 2:8:2 | 128810 | +
-| hector3 | 32 | 2:8:2 | 128810 | +
-| hector4 | 32 | 2:8:2 | 128810 | +
-| hector5 | 32 | 2:8:2 | 128810 | +
-| hector6 | 32 | 2:8:2 | 128810 | +
-| hector7 | 32 | 2:8:2 | 128810 | +
-| hector8 | 32 | 2:8:2 | 128810 | +
-| helena1 | 32 | 2:8:2 | 128811 | +
-helena2 | 32 | 2:8:2 | 128811 | +
-| helena3 | 32 | 2:8:2 | 128811 | +
-| helena4 | 32 | 2:8:2 | 128811 | +
-| helena5 | 32 | 2:8:2 | 128810 | +
-helena6 | 32 | 2:8:2 | 128811 | +
-| helena7 | 32 | 2:8:2 | 128810 | +
-| helena8 | 32 | 2:8:2 | 128811 | +
-| paris1 | 32 | 2:8:2 | 128810 | +
-| paris2 | 32 | 2:8:2 | 128810 | +
-| paris3 | 32 | 2:8:2 | 128810 | +
-| paris4 | 32 | 2:8:2 | 128810 | +
-| paris5 | 32 | 2:8:2 | 128810 | +
-| paris6 | 32 | 2:8:2 | 128810 | +
-| paris7 | 32 | 2:8:2 | 128810 | +
-| paris8 | 32 | 2:8:2 | 128810 | +
-| hyperion2 | 64 | 2:16:2 | 257667 | +
-| hyperion3 | 64 | 2:16:2 | 257667 | +
-| hyperion4 | 64 | 2:16:2 | 257667 | +
-| hyperion5 | 64 | 2:16:2 | 257667 | +
-| hyperion6 | 64 | 2:16:2 | 257667 | +
-| hyperion7 | 64 | 2:16:2 | 257667 | +
-| hyperion8 | 64 | 2:16:2 | 257667 |+
 ==== cpu-ms ==== ==== cpu-ms ====
  
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | | Node name | Thread count | Socket:Core:Thread | RAM (MB) |
 | iridium | 16 | 2:4:2 | 515977 | | iridium | 16 | 2:4:2 | 515977 |
-orion1 | 40 | 2:10:2 | 128799 | +orion[1-8] | 40 | 2:10:2 | 128799 |
-| orion2 | 40 | 2:10:2 | 128799 | +
-| orion3 | 40 | 2:10:2 | 128799 | +
-| orion4 | 40 | 2:10:2 | 128799 | +
-| orion5 | 40 | 2:10:2 | 128799 | +
-| orion6 | 40 | 2:10:2 | 128799 | +
-| orion7 | 40 | 2:10:2 | 128799 | +
-| orion8 | 40 | 2:10:2 | 128799 |+
 ==== gpu-troja ==== ==== gpu-troja ====
  
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type | | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
-| tdll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | 
-| tdll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| tdll-8gpu[1,2| 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 | 
-| tdll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +
-| tdll-8gpu1 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 | +
-| tdll-8gpu2 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 | +
-| tdll-8gpu3 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | +
-| tdll-8gpu4 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | +
-| tdll-8gpu5 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | +
-| tdll-8gpu6 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | +
-| tdll-8gpu7 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |+
 ==== gpu-ms ==== ==== gpu-ms ====
  
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type | | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
-| dll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | 
-| dll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| dll-4gpu[1,2| 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 | 
-| dll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| dll-4gpu3 62 1:32:2 | 515652 gpuram48G gpu_cc8.| NVIDIA L40 
-| dll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| dll-4gpu4 30 1:16:2 | 257616 gpuram48G gpu_cc8.| NVIDIA A40 
-| dll-3gpu5 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | +| dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 | 
-| dll-4gpu1 | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 | +| dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | 
-| dll-4gpu2 40 2:10:2 | 187978 gpuram24G gpu_cc8.| NVIDIA RTX 3090 +| dll-8gpu[5,6| 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |
-| dll-8gpu1 64 2:16:2 | 515838 gpuram24G gpu_cc8.| NVIDIA A30 +
-| dll-8gpu2 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 | +
-| dll-8gpu3 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | +
-| dll-8gpu4 | 32 | 2:8:2 | 253721 | gpuram16G gpu_cc8.| NVIDIA RTX A4000 | +
-| dll-8gpu5 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 | +
-| dll-8gpu6 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |+
 | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
-| dll-10gpu2 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti | +| dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
-| dll-10gpu3 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |+
  
  
Line 123: Line 72:
 #!/bin/bash #!/bin/bash
 #SBATCH -J helloWorld   # name of job #SBATCH -J helloWorld   # name of job
-#SBATCH -p cpu-troja   # name of partition or queue (default=cpu-ms)+#SBATCH -p cpu-troja   # name of partition or queue (default=cpu-troja)
 #SBATCH -o helloWorld.out   # name of output file for this submission script #SBATCH -o helloWorld.out   # name of output file for this submission script
 #SBATCH -e helloWorld.err   # name of error file for this submission script #SBATCH -e helloWorld.err   # name of error file for this submission script
Line 159: Line 108:
 </code> </code>
  
-==== Running jobs ====+=== Rudolf's template === 
 + 
 +The main point is for log files to have the job name and job id in them automatically. 
 + 
 +<code> 
 +#SBATCH -J RuRjob 
 +#SBATCH -o %x.%j.out 
 +#SBATCH -e %x.%j.err 
 +#SBATCH -p gpu-troja 
 +#SBATCH --gres=gpu:
 +#SBATCH --mem=16G 
 +#SBATCH --constraint="gpuram16G|gpuram24G" 
 + 
 +# Print each command to STDERR before executing (expanded), prefixed by "+ " 
 +set -o xtrace 
 +</code> 
 + 
 +==== Inspecting jobs ====
  
 In order to inspect all running jobs on the cluster use: In order to inspect all running jobs on the cluster use:
Line 240: Line 206:
 In the example above you can see comments at all lines relevant to CPU allocation. In the example above you can see comments at all lines relevant to CPU allocation.
  
 +=== Priority ====
  
 +When running srun or sbatch, you can pass ''-q high/normal/low/preempt-low''. These represent priorities 300/200/100/100, with ''normal'' (200) being the default. Furthermore, the ''preempt-low'' QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your ''preempt-low'' job.
  
 +The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources.
  
 ==== Interactive mode ==== ==== Interactive mode ====
Line 272: Line 241:
   * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined   * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined
  
 +
 +\\
 +**Unexpected Behavior of ''srun -c1''**
 +When you execute a command using ''srun'' and pass ''-c1'' like
 +<code>srun -c1 date</code>
 +then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the ''-c1''** or also **add explicit ''-n1''.**
 ==== Delete Job ==== ==== Delete Job ====
 <code>scancel <job_id> </code> <code>scancel <job_id> </code>
 +
 +<code>scancel -n <job_name> </code>
 +
  
 To see all the available options type: To see all the available options type:
  
-<code>man srun</code>+<code>man scancel</code> 
 + 
 +==== Basic commands on cluster machines ==== 
 + 
 +  lspci 
 +    # is any such hardware there? 
 +  nvidia-smi 
 +    # more details, incl. running processes on the GPU 
 +    # nvidia-* are typically located in /usr/bin 
 +  watch nvidia-smi 
 +    # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!) 
 +    # You can also use nvidia-smi -l TIME 
 +  nvcc --version 
 +    # this should tell CUDA version 
 +    # nvcc is typically installed in /usr/local/cuda/bin/ 
 +  theano-test 
 +    # dela to vubec neco uzitecneho? :-) 
 +    # theano-* are typically located in /usr/local/bin/ 
 +  /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery 
 +    # shows CUDA capability etc. 
 +  ssh dll1; ~popel/bin/gpu_allocations 
 +    # who occupies which card on a given machine 
 +     
  
 ===== See also ===== ===== See also =====

[ Back to the navigation ] [ Back to the content ]