[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
slurm [2022/08/31 11:13]
vodrazka [Interactive mode]
slurm [2022/09/27 13:54]
vodrazka [Batch mode]
Line 1: Line 1:
 ====== ÚFAL Grid Engine (LRC) ====== ====== ÚFAL Grid Engine (LRC) ======
  
-LRC (Linguistic Research Cluster) is name of ÚFAL's computational grid/cluster.+LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of [[https://slurm.schedmd.com/|SLURM]] and is using [[https://www.lustre.org/|Lustre]] for [[internal:linux-network#directory-structure|data storage]].
  
 +Currently there are following partitions (queues) available for computing:
 +
 +| **Partition name** | **Nodes**  | **Note** |
 +| cpu-troja      | 7x CPU | default partition |
 +| gpu-troja      | 6x GPU | features: gpuram48G,gpuram40G |
 +| gpu-ms         | 7x GPU | features: gpuram48G,gpuram24G |
 +
 +In order to submit a job you need to login to one of the head nodes:
 +
 +   lrc1.ufal.hide.ms.mff.cuni.cz
 +   lrc2.ufal.hide.ms.mff.cuni.cz
 ===== Basic usage ===== ===== Basic usage =====
 +
 +==== Batch mode ====
 +
 +The core idea is that you write a batch script containing the commands you wish to run as well as a list of ''SBATCH'' directives specifying the resources or parameters that you need for your job.
 +Then the script is submitted to the cluster with:
 +
 +<code>sbatch myJobScript.sh</code>
 +
 +Here is a simple working example:
 +
 +<code>
 +#!/bin/bash
 +#SBATCH -J helloWorld   # name of job
 +#SBATCH -p cpu-troja   # name of partition or queue (if not specified default partition is used)
 +#SBATCH -o helloWorld.out   # name of output file for this submission script
 +#SBATCH -e helloWorld.err   # name of error file for this submission script
 +
 +# run my job (some executable)
 +sleep 5
 +echo "Hello I am running on cluster!"
 +</code>
 +
 +After submitting this simple code you should end up with the two files (''helloWorld.out'' and ''helloWorld.err'') in the directory where you called the ''sbatch'' command.
 +
 +Here is the list of other useful ''SBATCH'' directives:
 +<code>
 +#SBATCH -D /some/path/                        # change directory before executing the job   
 +#SBATCH -N 2                                  # number of nodes (default 1)
 +#SBATCH --nodelist=node1,node2...             # required node, or comma separated list of required nodes
 +#SBATCH --cpus-per-task=4                     # number of cores/threads per task (default 1)
 +#SBATCH --gres=gpu:                         # number of GPUs to request (default 0)
 +#SBATCH --mem=10G                             # request 10 gigabytes memory (per node, default depends on node)
 +</code>
 +
 +If you need you can have slurm report to you:
 +
 +<code>
 +#SBATCH --mail-type=begin        # send email when job begins
 +#SBATCH --mail-type=end          # send email when job ends
 +#SBATCH --mail-type=fail         # send email if job fails
 +#SBATCH --mail-user=<YourUFALEmailAccount>
 +</code>
 +
 +As usuall the complete set of options can be found by typing:
 +
 +<code>
 +man sbatch
 +</code>
 +
 +==== Running jobs ====
 +
 +In order to inspect all running jobs on the cluster use:
 +
 +<code>
 +squeue
 +</code>
 +
 +filter only jobs of user ''linguist'':
 +
 +<code>
 +squeue -u linguist
 +</code>
 +
 +filter only jobs on partition ''gpu-ms'':
 +
 +<code>
 +squeue -p gpu-ms
 +</code>
 +
 +filter jobs in specific state (see ''man squeue'' for list of valid job states):
 +<code>
 +squeue -t RUNNING
 +</code>
 +
 +filter jobs running on a specific node:
 +<code>
 +squeue -w dll-3gpu1
 +</code>
 +
 +==== Cluster info ====
 +
 +The command ''sinfo'' can give you useful information about nodes available in the cluster. Here is a short list of some examples:
 +
 +List available partitions(queues). The default partition is marked with ''*'':
 +<code>
 +sinfo
 +</code>
 +
 +List detailed info about nodes:
 +<code>
 +sinfo -l -N
 +</code> 
 +
 +List nodes with some custom format info:
 +<code>
 +sinfo -N -o "%N %P %.11T %.15f"
 +</code>
 +
 +
 +
 +
 +
  
 ==== Interactive mode ==== ==== Interactive mode ====
Line 16: Line 129:
 <code>srun -p cpu-troja --mem=64G --pty bash</code> <code>srun -p cpu-troja --mem=64G --pty bash</code>
  
-Where: +  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition.
-  * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''+
   * ''--mem=64G'' requires 64G of memory for the job   * ''--mem=64G'' requires 64G of memory for the job
  
-The detailed info can be found in the manual ''man srun''.+<code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code> 
 +  * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions 
 +  * ''--nodelist=tdll-3gpu1'' explicitly requires one specific node 
 +  * ''--gres=gpu:2'' requires 2 GPUs 
 + 
 +<code>srun -p gpu-troja --constraint="gpuram44G|gpuram39G" --mem=64G --gres=gpu:2 --pty bash</code> 
 +  * ''--constraint="gpuram44G|gpuram39G"'' only consider nodes that have either ''gpuram44G'' or ''gpuram39G'' feature defined
  
 +To see all the available options type:
  
 +<code>man srun</code>
  

[ Back to the navigation ] [ Back to the content ]