| Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
slurm [2023/04/13 17:01] dusek Priority |
slurm [2025/10/15 18:09] (current) straka [ÚFAL Grid Engine (LRC)] |
| ====== ÚFAL Grid Engine (LRC) ====== | ====== ÚFAL Grid Engine (LRC) ====== |
| | |
| | **IN 2024: Newly, all the documentation is at a dedicated wiki https://ufal.mff.cuni.cz/lrc (you need to use ufal and [[internal:welcome-at-ufal#small-linguistic-password|small-linguistic password]] to access the wiki from outside of the UFAL network).*** |
| |
| LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]]. | LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]]. |
| |
| See Milan Straka's intro to Slurm (and Spark if you want): | See Milan Straka's intro to Slurm (and Spark and possibly also the [[https://ufal.mff.cuni.cz/courses/npfl118#assignments|NPFL118 assingments]] if you want). Use the username=ufal and small linguistic password: |
| |
| * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-slurm.mp4 | * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2425/npfl118-2425-winter-slurm.mp4 |
| * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2223/npfl118-2223-winter-spark.mp4 | * https://lectures.ms.mff.cuni.cz/video/rec/npfl118/2425/npfl118-2425-winter-spark.mp4 |
| |
| Currently there are following partitions (queues) available for computing: | Currently there are following partitions (queues) available for computing: |
| </code> | </code> |
| |
| ==== Running jobs ==== | ==== Inspecting jobs ==== |
| |
| In order to inspect all running jobs on the cluster use: | In order to inspect all running jobs on the cluster use: |
| <code> | <code> |
| squeue | squeue |
| | </code> |
| | |
| | filter only my jobs |
| | |
| | <code> |
| | squeue --me |
| </code> | </code> |
| |
| === Priority ==== | === Priority ==== |
| |
| When running srun or sbatch, you can pass `-q high/normal/low/preempt-low`. These represent priorities 300/200/100/100, with `normal` (200) being the default. Furthermore, the `preempt-low` QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your `preempt-low` job. | When running srun or sbatch, you can pass ''-q high/normal/low/preempt-low''. These represent priorities 300/200/100/100, with ''normal'' (200) being the default. Furthermore, the ''preempt-low'' QOS is actually preemptible -- if there is a job with normal or high QOS, they can interrupt your ''preempt-low'' job. |
| |
| The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources. | The preemption has probably not been used by anyone yet; some documentation about it is on https://slurm.schedmd.com/preempt.html, we use the REQUEUE regime (so your job is killed, very likely with some signal, so you could monitor it and for example save a checkpoint; but currently I do not know any details), and then started again when there are resources. |
| * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined | * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined |
| |
| | |
| | \\ |
| | **Unexpected Behavior of ''srun -c1''** |
| | When you execute a command using ''srun'' and pass ''-c1'' like |
| | <code>srun -c1 date</code> |
| | then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the ''-c1''** or also **add explicit ''-n1''.** |
| ==== Delete Job ==== | ==== Delete Job ==== |
| <code>scancel <job_id> </code> | <code>scancel <job_id> </code> |
| | |
| | <code>scancel -n <job_name> </code> |
| | |
| |
| To see all the available options type: | To see all the available options type: |
| |
| <code>man srun</code> | <code>man scancel</code> |
| |
| ==== Basic commands on cluster machines ==== | ==== Basic commands on cluster machines ==== |