[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
slurm [2023/03/15 17:34]
rosa [ÚFAL Grid Engine (LRC)]
slurm [2024/01/09 19:54] (current)
popel
Line 3: Line 3:
 LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]]. LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
  
-See Milan Straka's intro to Slurm (and Spark if you want):+See Milan Straka's intro to Slurm (and Spark and possibly also the [[https://ufal.mff.cuni.cz/courses/npfl118#assignments|NPFL118 assingments]] if you want). Use the username=ufal and small linguistic password:
  
-  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-slurm.mp4 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-slurm.mp4 
-  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-spark.mp4+  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-spark.mp4 
 +  * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2324/npfl118-2324-winter-assignments.mp4
  
 Currently there are following partitions (queues) available for computing: Currently there are following partitions (queues) available for computing:
Line 38: Line 39:
 | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
 | dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 | | dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 |
 +| dll-4gpu3 | 62 | 1:32:2 | 515652 | gpuram48G gpu_cc8.9 | NVIDIA L40 |
 +| dll-4gpu4 | 30 | 1:16:2 | 257616 | gpuram48G gpu_cc8.6 | NVIDIA A40 |
 | dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 | | dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 |
 | dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 |
Line 123: Line 126:
 </code> </code>
  
-==== Running jobs ====+==== Inspecting jobs ====
  
 In order to inspect all running jobs on the cluster use: In order to inspect all running jobs on the cluster use:
Line 204: Line 207:
 In the example above you can see comments at all lines relevant to CPU allocation. In the example above you can see comments at all lines relevant to CPU allocation.
  
 +=== Priority ====
  
 +When running srun or sbatch, you can pass ''-q high/normal/low/preempt-low''. These represent priorities 300/200/100/100, with ''normal'' (200) being the default. Furthermore, the ''preempt-low'' QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your ''preempt-low'' job.
  
 +The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources.
  
 ==== Interactive mode ==== ==== Interactive mode ====
Line 236: Line 242:
   * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined   * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined
  
 +
 +\\
 +**Unexpected Behavior of ''srun -c1''**
 +When you execute a command using ''srun'' and pass ''-c1'' like
 +<code>srun -c1 date</code>
 +then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the ''-c1''** or also **add explicit ''-n1''.**
 ==== Delete Job ==== ==== Delete Job ====
 <code>scancel <job_id> </code> <code>scancel <job_id> </code>
 +
 +<code>scancel -n <job_name> </code>
 +
  
 To see all the available options type: To see all the available options type:
  
-<code>man srun</code>+<code>man scancel</code>
  
 ==== Basic commands on cluster machines ==== ==== Basic commands on cluster machines ====

[ Back to the navigation ] [ Back to the content ]