Differences

This shows you the differences between two versions of the page.

--- slurm [2023/01/19 15:37]
vodrazka [cpu-troja]
+++ slurm [2025/10/15 18:09] (current)
straka [ÚFAL Grid Engine (LRC)]
@@ Line 1: / Line 1: @@
 ====== ÚFAL Grid Engine (LRC) ======
+**IN 2024: Newly, all the documentation is at a dedicated wiki https://ufal.mff.cuni.cz/lrc (you need to use ufal and [[internal:welcome-at-ufal#small-linguistic-password|small-linguistic password]] to access the wiki from outside of the UFAL network).***
 LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
+See Milan Straka's intro to Slurm (and Spark and possibly also the [[https://ufal.mff.cuni.cz/courses/npfl118#assignments|NPFL118 assingments]] if you want). Use the username=ufal and small linguistic password:
+  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2425/npfl118-2425-winter-slurm.mp4
+  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2425/npfl118-2425-winter-spark.mp4
 Currently there are following partitions (queues) available for computing:
@@ Line 21: / Line 28: @@
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) |
 | iridium | 16 | 2:4:2 | 515977 |
-| orion1 | 40 | 2:10:2 | 128799 |
+| orion[1-8] | 40 | 2:10:2 | 128799 |
-| orion2 | 40 | 2:10:2 | 128799 |
-| orion3 | 40 | 2:10:2 | 128799 |
-| orion4 | 40 | 2:10:2 | 128799 |
-| orion5 | 40 | 2:10:2 | 128799 |
-| orion6 | 40 | 2:10:2 | 128799 |
-| orion7 | 40 | 2:10:2 | 128799 |
-| orion8 | 40 | 2:10:2 | 128799 |
 ==== gpu-troja ====
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
-| tdll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
-| tdll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| tdll-8gpu[1,2] | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 |
-| tdll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
-| tdll-8gpu1 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 |
-| tdll-8gpu2 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 |
-| tdll-8gpu3 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-8gpu4 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-8gpu5 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-8gpu6 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
-| tdll-8gpu7 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 |
 ==== gpu-ms ====
 | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type |
-| dll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
-| dll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 |
-| dll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-4gpu3 | 62 | 1:32:2 | 515652 | gpuram48G gpu_cc8.9 | NVIDIA L40 |
-| dll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-4gpu4 | 30 | 1:16:2 | 257616 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
-| dll-3gpu5 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
+| dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 |
-| dll-4gpu1 | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 |
+| dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
-| dll-4gpu2 | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 |
+| dll-8gpu[5,6] | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |
-| dll-8gpu1 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 |
-| dll-8gpu2 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 |
-| dll-8gpu3 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
-| dll-8gpu4 | 32 | 2:8:2 | 253721 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
-| dll-8gpu5 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |
-| dll-8gpu6 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 |
 | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
-| dll-10gpu2 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
+| dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
-| dll-10gpu3 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti |
@@ Line 125: / Line 110: @@
 </code>
-==== Running jobs ====
+=== Rudolf's template ===
+The main point is for log files to have the job name and job id in them automatically.
+<code>
+#SBATCH -J RuRjob
+#SBATCH -o %x.%j.out
+#SBATCH -e %x.%j.err
+#SBATCH -p gpu-troja
+#SBATCH --gres=gpu:1
+#SBATCH --mem=16G
+#SBATCH --constraint="gpuram16G|gpuram24G"
+# Print each command to STDERR before executing (expanded), prefixed by "+ "
+set -o xtrace
+</code>
+==== Inspecting jobs ====
 In order to inspect all running jobs on the cluster use:
@@ Line 131: / Line 133: @@
 <code>
 squeue
+</code>
+filter only my jobs
+<code>
+squeue --me
 </code>
@@ Line 206: / Line 214: @@
 In the example above you can see comments at all lines relevant to CPU allocation.
+=== Priority ====
+When running srun or sbatch, you can pass ''-q high/normal/low/preempt-low''. These represent priorities 300/200/100/100, with ''normal'' (200) being the default. Furthermore, the ''preempt-low'' QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your ''preempt-low'' job.
+The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources.
 ==== Interactive mode ====
@@ Line 238: / Line 249: @@
   * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined
+\\
+**Unexpected Behavior of ''srun -c1''**
+When you execute a command using ''srun'' and pass ''-c1'' like
+<code>srun -c1 date</code>
+then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the ''-c1''** or also **add explicit ''-n1''.**
 ==== Delete Job ====
 <code>scancel <job_id> </code>
+<code>scancel -n <job_name> </code>
 To see all the available options type:
-<code>man srun</code>
+<code>man scancel</code>
+==== Basic commands on cluster machines ====
+  lspci
+    # is any such hardware there?
+  nvidia-smi
+    # more details, incl. running processes on the GPU
+    # nvidia-* are typically located in /usr/bin
+  watch nvidia-smi
+    # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!)
+    # You can also use nvidia-smi -l TIME
+  nvcc --version
+    # this should tell CUDA version
+    # nvcc is typically installed in /usr/local/cuda/bin/
+  theano-test
+    # dela to vubec neco uzitecneho? :-)
+    # theano-* are typically located in /usr/local/bin/
+  /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery
+    # shows CUDA capability etc.
+  ssh dll1; ~popel/bin/gpu_allocations
+    # who occupies which card on a given machine
 ===== See also =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences