Differences

This shows you the differences between two versions of the page.

--- slurm [2022/09/08 12:48]
vodrazka [Interactive mode]
+++ slurm [2022/10/25 15:26]
vodrazka [Submit nodes]
@@ Line 1: / Line 1: @@
 ====== ÚFAL Grid Engine (LRC) ======
-LRC (Linguistic Research Cluster) is a name of ÚFAL's computational grid/cluster.
+LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
+Currently there are following partitions (queues) available for computing:
+===== Node list by partitions =====
+==== cpu-troja ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) |
+| achilles1 | 32 | 2:8:2 | 128810 |
+| achilles2 | 32 | 2:8:2 | 128810 |
+| achilles3 | 32 | 2:8:2 | 128810 |
+| achilles4 | 32 | 2:8:2 | 128810 |
+| achilles5 | 32 | 2:8:2 | 128810 |
+| achilles6 | 32 | 2:8:2 | 128810 |
+| achilles7 | 32 | 2:8:2 | 128810 |
+| achilles8 | 32 | 2:8:2 | 128810 |
+| hector1 | 32 | 2:8:2 | 128810 |
+| hector2 | 32 | 2:8:2 | 128810 |
+| hector3 | 32 | 2:8:2 | 128810 |
+| hector4 | 32 | 2:8:2 | 128810 |
+| hector5 | 32 | 2:8:2 | 128810 |
+| hector6 | 32 | 2:8:2 | 128810 |
+| hector7 | 32 | 2:8:2 | 128810 |
+| hector8 | 32 | 2:8:2 | 128810 |
+| helena1 | 32 | 2:8:2 | 128811 |
+| helena2 | 32 | 2:8:2 | 128811 |
+| helena3 | 32 | 2:8:2 | 128811 |
+| helena4 | 32 | 2:8:2 | 128811 |
+| helena5 | 32 | 2:8:2 | 128810 |
+| helena6 | 32 | 2:8:2 | 128811 |
+| helena7 | 32 | 2:8:2 | 128810 |
+| helena8 | 32 | 2:8:2 | 128811 |
+| paris1 | 32 | 2:8:2 | 128810 |
+| paris2 | 32 | 2:8:2 | 128810 |
+| paris3 | 32 | 2:8:2 | 128810 |
+| paris4 | 32 | 2:8:2 | 128810 |
+| paris5 | 32 | 2:8:2 | 128810 |
+| paris6 | 32 | 2:8:2 | 128810 |
+| paris7 | 32 | 2:8:2 | 128810 |
+| paris8 | 32 | 2:8:2 | 128810 |
+| hyperion2 | 64 | 2:16:2 | 257667 |
+| hyperion3 | 64 | 2:16:2 | 257667 |
+| hyperion4 | 64 | 2:16:2 | 257667 |
+| hyperion5 | 64 | 2:16:2 | 257667 |
+| hyperion6 | 64 | 2:16:2 | 257667 |
+| hyperion7 | 64 | 2:16:2 | 257667 |
+| hyperion8 | 64 | 2:16:2 | 257667 |
+==== cpu-ms ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) |
+| iridium | 16 | 2:4:2 | 515977 |
+| orion1 | 40 | 2:10:2 | 128799 |
+| orion2 | 40 | 2:10:2 | 128799 |
+| orion3 | 40 | 2:10:2 | 128799 |
+| orion4 | 40 | 2:10:2 | 128799 |
+| orion5 | 40 | 2:10:2 | 128799 |
+| orion6 | 40 | 2:10:2 | 128799 |
+| orion7 | 40 | 2:10:2 | 128799 |
+| orion8 | 40 | 2:10:2 | 128799 |
+==== gpu-troja ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features |
+| tdll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| tdll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| tdll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| tdll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| tdll-8gpu1 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 |
+| tdll-8gpu2 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 |
+| tdll-8gpu3 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 |
+| tdll-8gpu4 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 |
+| tdll-8gpu5 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 |
+| tdll-8gpu6 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 |
+| tdll-8gpu7 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 |
+==== gpu-ms ====
+| Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features |
+| dll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| dll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| dll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| dll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| dll-3gpu5 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 |
+| dll-4gpu1 | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 |
+| dll-4gpu2 | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 |
+| dll-8gpu1 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 |
+| dll-8gpu2 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 |
+| dll-8gpu3 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 |
+| dll-8gpu4 | 32 | 2:8:2 | 253721 | gpuram16G gpu_cc8.6 |
+| dll-8gpu5 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 |
+| dll-8gpu6 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 |
+| dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 |
+| dll-10gpu2 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 |
+| dll-10gpu3 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 |
+==== Submit nodes ====
+In order to submit a job you need to login to one of the head nodes:
+   lrc1.ufal.hide.ms.mff.cuni.cz
+   lrc2.ufal.hide.ms.mff.cuni.cz
+   sol1.ufal.hide.ms.mff.cuni.cz
+   sol2.ufal.hide.ms.mff.cuni.cz
+   sol3.ufal.hide.ms.mff.cuni.cz
+   sol4.ufal.hide.ms.mff.cuni.cz
 ===== Basic usage =====
@@ Line 17: / Line 121: @@
 #!/bin/bash
 #SBATCH -J helloWorld					  # name of job
-#SBATCH -p cpu-troja					  # name of partition or queue
+#SBATCH -p cpu-troja					  # name of partition or queue (if not specified default partition is used)
 #SBATCH -o helloWorld.out				  # name of output file for this submission script
 #SBATCH -e helloWorld.err				  # name of error file for this submission script
@@ Line 33: / Line 137: @@
 #SBATCH -N 2                                  # number of nodes (default 1)
 #SBATCH --nodelist=node1,node2...             # required node, or comma separated list of required nodes
-#SBATCH -c 4                                  # number of cores/threads per task (default 1)
+#SBATCH --cpus-per-task=4                     # number of cores/threads per task (default 1)
 #SBATCH --gres=gpu:1                          # number of GPUs to request (default 0)
 #SBATCH --mem=10G                             # request 10 gigabytes memory (per node, default depends on node)
@@ Line 90: / Line 194: @@
 <code>
 sinfo
+</code>
+List detailed info about nodes:
+<code>
+sinfo -l -N
 </code>
-List types of available GPUs:
+List nodes with some custom format info:
 <code>
-sinfo -o %G
+sinfo -N -o "%N %P %.11T %.15f"
 </code>
+=== CPU core allocation ===
+The minimal computing resource in SLURM is one CPU core. However, CPU count advertised by SLURM corresponds to the number of CPU threads.
+If you ask for 1 CPU core with <code>--cpus-per-task=1</code> SLURM will allocate all threads of 1 CPU core.
+For example ''dll-8gpu1'' will allocate 2 threads since its ThreadsPerCore=2:
+<code>
+$> scontrol show node dll-8gpu1
+$ scontrol show node dll-8gpu1
+NodeName=dll-8gpu1 Arch=x86_64 CoresPerSocket=16
+   CPUAlloc=0 CPUTot=64 CPULoad=0.05                                               // CPUAlloc - allocated threads, CPUTot - total threads
+   AvailableFeatures=gpuram24G
+   ActiveFeatures=gpuram24G
+   Gres=gpu:nvidia_a30:8(S:0-1)
+   NodeAddr=10.10.24.63 NodeHostName=dll-8gpu1 Version=21.08.8-2
+   OS=Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-3 (Wed, 11 May 2022 07:57:51 +0200)
+   RealMemory=515838 AllocMem=0 FreeMem=507650 Sockets=2 Boards=1
+   CoreSpecCount=1 CPUSpecList=62-63                                               // CoreSpecCount - cores reserved for OS, CPUSpecList - list of threads reserved for system
+   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A          // ThreadsPerCore - count of threads for 1 CPU core
+   Partitions=gpu-ms
+   BootTime=2022-09-01T14:07:50 SlurmdStartTime=2022-09-02T13:54:05
+   LastBusyTime=2022-10-02T20:17:09
+   CfgTRES=cpu=64,mem=515838M,billing=64
+   AllocTRES=
+   CapWatts=n/a
+   CurrentWatts=0 AveWatts=0
+   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s
+</code>
+In the example above you can see comments at all lines relevant to CPU allocation.
@@ Line 111: / Line 250: @@
 There are many more parameters available to use. For example:
-<code>srun -p cpu-troja --mem=64G --pty bash</code>
+**To get an interactive CPU job with 64GB of reserved memory:**
+<code>srun -p cpu-troja,cpu-ms --mem=64G --pty bash</code>
-  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''
+  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition.
   * ''--mem=64G'' requires 64G of memory for the job
-<code>srun -p gpu-troja --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code>
+**To get interactive job with a single GPU of any kind:**
+<code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code>
+  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions
+  * ''--gres=gpu:1'' requires 1 GPUs
+<code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code>
+  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions
   * ''--nodelist=tdll-3gpu1'' explicitly requires one specific node
   * ''--gres=gpu:2'' requires 2 GPUs
+<code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code>
+  * ''--constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined
 To see all the available options type:
 <code>man srun</code>
+===== See also =====
+https://www.msi.umn.edu/slurm/pbs-conversion

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences