Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
slurm [2022/08/31 13:51] vodrazka [Batch mode] |
slurm [2025/10/15 18:09] (current) straka [ÚFAL Grid Engine (LRC)] |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== ÚFAL Grid Engine (LRC) ====== | ====== ÚFAL Grid Engine (LRC) ====== | ||
| - | LRC (Linguistic Research Cluster) | + | **IN 2024: Newly, all the documentation |
| + | LRC (Linguistic Research Cluster) is the name of ÚFAL' | ||
| + | |||
| + | See Milan Straka' | ||
| + | |||
| + | * https:// | ||
| + | * https:// | ||
| + | |||
| + | Currently there are following partitions (queues) available for computing: | ||
| + | |||
| + | ===== Node list by partitions ===== | ||
| + | |||
| + | The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-**X**gpu**N** where **X** gives total number of GPUs equipped and **N** is just enumerating the order of the node with the given configuration. | ||
| + | The prefix **t** is for nodes at Troja and **dll** stands for Deep Learning Laboratory. | ||
| + | ==== cpu-troja ==== | ||
| + | |||
| + | | Node name | Thread count | Socket: | ||
| + | | achilles[1-8] | 32 | 2:8:2 | 128810 | | ||
| + | | hector[1-8] | 32 | 2:8:2 | 128810 | | ||
| + | | helena[1-8] | 32 | 2:8:2 | 128811 | | ||
| + | | paris[1-8] | 32 | 2:8:2 | 128810 | | ||
| + | | hyperion[2-8] | 64 | 2:16:2 | 257667 | | ||
| + | ==== cpu-ms ==== | ||
| + | |||
| + | | Node name | Thread count | Socket: | ||
| + | | iridium | 16 | 2:4:2 | 515977 | | ||
| + | | orion[1-8] | 40 | 2:10:2 | 128799 | | ||
| + | ==== gpu-troja ==== | ||
| + | |||
| + | | Node name | Thread count | Socket: | ||
| + | | tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | ||
| + | | tdll-8gpu[1, | ||
| + | | tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | | ||
| + | ==== gpu-ms ==== | ||
| + | |||
| + | | Node name | Thread count | Socket: | ||
| + | | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | ||
| + | | dll-4gpu[1, | ||
| + | | dll-4gpu3 | 62 | 1:32:2 | 515652 | gpuram48G gpu_cc8.9 | NVIDIA L40 | | ||
| + | | dll-4gpu4 | 30 | 1:16:2 | 257616 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | ||
| + | | dll-8gpu[1, | ||
| + | | dll-8gpu[3, | ||
| + | | dll-8gpu[5, | ||
| + | | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | ||
| + | | dll-10gpu[2, | ||
| + | |||
| + | |||
| + | ==== Submit nodes ==== | ||
| + | |||
| + | |||
| + | In order to submit a job you need to login to one of the head nodes: | ||
| + | |||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| ===== Basic usage ===== | ===== Basic usage ===== | ||
| Line 17: | Line 74: | ||
| #!/bin/bash | #!/bin/bash | ||
| #SBATCH -J helloWorld | #SBATCH -J helloWorld | ||
| - | #SBATCH -p cpu-troja | + | #SBATCH -p cpu-troja |
| #SBATCH -o helloWorld.out | #SBATCH -o helloWorld.out | ||
| #SBATCH -e helloWorld.err | #SBATCH -e helloWorld.err | ||
| Line 28: | Line 85: | ||
| After submitting this simple code you should end up with the two files ('' | After submitting this simple code you should end up with the two files ('' | ||
| - | Here is the list of other useful '' | + | Here is the list of other useful '' |
| < | < | ||
| + | #SBATCH -D / | ||
| #SBATCH -N 2 # number of nodes (default 1) | #SBATCH -N 2 # number of nodes (default 1) | ||
| - | #SBATCH --nodelist=node1, | + | #SBATCH --nodelist=node1, |
| - | #SBATCH -c 4 # number of cores/ | + | #SBATCH --cpus-per-task=4 |
| #SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
| #SBATCH --mem=10G | #SBATCH --mem=10G | ||
| </ | </ | ||
| + | |||
| + | If you need you can have slurm report to you: | ||
| + | |||
| + | < | ||
| + | #SBATCH --mail-type=begin | ||
| + | #SBATCH --mail-type=end | ||
| + | #SBATCH --mail-type=fail | ||
| + | #SBATCH --mail-user=< | ||
| + | </ | ||
| + | |||
| + | As usuall the complete set of options can be found by typing: | ||
| + | |||
| + | < | ||
| + | man sbatch | ||
| + | </ | ||
| + | |||
| + | === Rudolf' | ||
| + | |||
| + | The main point is for log files to have the job name and job id in them automatically. | ||
| + | |||
| + | < | ||
| + | #SBATCH -J RuRjob | ||
| + | #SBATCH -o %x.%j.out | ||
| + | #SBATCH -e %x.%j.err | ||
| + | #SBATCH -p gpu-troja | ||
| + | #SBATCH --gres=gpu: | ||
| + | #SBATCH --mem=16G | ||
| + | #SBATCH --constraint=" | ||
| + | |||
| + | # Print each command to STDERR before executing (expanded), prefixed by "+ " | ||
| + | set -o xtrace | ||
| + | </ | ||
| + | |||
| + | ==== Inspecting jobs ==== | ||
| + | |||
| + | In order to inspect all running jobs on the cluster use: | ||
| + | |||
| + | < | ||
| + | squeue | ||
| + | </ | ||
| + | |||
| + | filter only my jobs | ||
| + | |||
| + | < | ||
| + | squeue --me | ||
| + | </ | ||
| + | |||
| + | filter only jobs of user '' | ||
| + | |||
| + | < | ||
| + | squeue -u linguist | ||
| + | </ | ||
| + | |||
| + | filter only jobs on partition '' | ||
| + | |||
| + | < | ||
| + | squeue -p gpu-ms | ||
| + | </ | ||
| + | |||
| + | filter jobs in specific state (see '' | ||
| + | < | ||
| + | squeue -t RUNNING | ||
| + | </ | ||
| + | |||
| + | filter jobs running on a specific node: | ||
| + | < | ||
| + | squeue -w dll-3gpu1 | ||
| + | </ | ||
| + | |||
| + | ==== Cluster info ==== | ||
| + | |||
| + | The command '' | ||
| + | |||
| + | List available partitions(queues). The default partition is marked with '' | ||
| + | < | ||
| + | sinfo | ||
| + | </ | ||
| + | |||
| + | List detailed info about nodes: | ||
| + | < | ||
| + | sinfo -l -N | ||
| + | </ | ||
| + | |||
| + | List nodes with some custom format info: | ||
| + | < | ||
| + | sinfo -N -o "%N %P %.11T %.15f" | ||
| + | </ | ||
| + | |||
| + | === CPU core allocation === | ||
| + | |||
| + | The minimal computing resource in SLURM is one CPU core. However, CPU count advertised by SLURM corresponds to the number of CPU threads. | ||
| + | If you ask for 1 CPU core with < | ||
| + | |||
| + | For example '' | ||
| + | |||
| + | < | ||
| + | $> scontrol show node dll-8gpu1 | ||
| + | $ scontrol show node dll-8gpu1 | ||
| + | NodeName=dll-8gpu1 Arch=x86_64 CoresPerSocket=16 | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | </ | ||
| + | |||
| + | In the example above you can see comments at all lines relevant to CPU allocation. | ||
| + | |||
| + | === Priority ==== | ||
| + | |||
| + | When running srun or sbatch, you can pass '' | ||
| + | |||
| + | The preemption has probably not been used by anyone yet; some documentation about it is on https:// | ||
| ==== Interactive mode ==== | ==== Interactive mode ==== | ||
| Line 46: | Line 229: | ||
| There are many more parameters available to use. For example: | There are many more parameters available to use. For example: | ||
| - | < | + | **To get an interactive CPU job with 64GB of reserved memory:** |
| + | < | ||
| + | |||
| + | * '' | ||
| + | * '' | ||
| + | |||
| + | **To get interactive job with a single GPU of any kind:** | ||
| + | < | ||
| + | * '' | ||
| + | * '' | ||
| + | |||
| + | < | ||
| + | * '' | ||
| + | * '' | ||
| + | * Note that e.g. '' | ||
| + | * '' | ||
| + | |||
| + | < | ||
| + | * '' | ||
| + | |||
| + | |||
| + | \\ | ||
| + | **Unexpected Behavior of '' | ||
| + | When you execute a command using '' | ||
| + | < | ||
| + | then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the '' | ||
| + | ==== Delete Job ==== | ||
| + | < | ||
| + | |||
| + | < | ||
| - | Where: | ||
| - | * '' | ||
| - | * '' | ||
| To see all the available options type: | To see all the available options type: | ||
| - | < | + | < |
| + | |||
| + | ==== Basic commands on cluster machines ==== | ||
| + | |||
| + | lspci | ||
| + | # is any such hardware there? | ||
| + | nvidia-smi | ||
| + | # more details, incl. running processes on the GPU | ||
| + | # nvidia-* are typically located in /usr/bin | ||
| + | watch nvidia-smi | ||
| + | # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!) | ||
| + | # You can also use nvidia-smi -l TIME | ||
| + | nvcc --version | ||
| + | # this should tell CUDA version | ||
| + | # nvcc is typically installed in / | ||
| + | theano-test | ||
| + | # dela to vubec neco uzitecneho? :-) | ||
| + | # theano-* are typically located in / | ||
| + | / | ||
| + | # shows CUDA capability etc. | ||
| + | ssh dll1; ~popel/ | ||
| + | # who occupies which card on a given machine | ||
| + | |||
| + | |||
| + | |||
| + | ===== See also ===== | ||
| + | |||
| + | https:// | ||
