Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
slurm [2022/10/25 15:26] vodrazka [Submit nodes] |
slurm [2024/10/02 15:22] (current) popel |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== ÚFAL Grid Engine (LRC) ====== | ====== ÚFAL Grid Engine (LRC) ====== | ||
+ | |||
+ | **IN 2024: Newly, all the documentation is at a dedicated wiki https:// | ||
LRC (Linguistic Research Cluster) is the name of ÚFAL' | LRC (Linguistic Research Cluster) is the name of ÚFAL' | ||
+ | |||
+ | See Milan Straka' | ||
+ | |||
+ | * https:// | ||
+ | * https:// | ||
+ | * https:// | ||
Currently there are following partitions (queues) available for computing: | Currently there are following partitions (queues) available for computing: | ||
Line 7: | Line 15: | ||
===== Node list by partitions ===== | ===== Node list by partitions ===== | ||
+ | The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-**X**gpu**N** where **X** gives total number of GPUs equipped and **N** is just enumerating the order of the node with the given configuration. | ||
+ | The prefix **t** is for nodes at Troja and **dll** stands for Deep Learning Laboratory. | ||
==== cpu-troja ==== | ==== cpu-troja ==== | ||
| Node name | Thread count | Socket: | | Node name | Thread count | Socket: | ||
- | | achilles1 | 32 | 2:8:2 | 128810 | | + | | achilles[1-8] | 32 | 2:8:2 | 128810 | |
- | | achilles2 | + | | hector[1-8] | 32 | 2:8:2 | 128810 | |
- | | achilles3 | 32 | 2:8:2 | 128810 | | + | | helena[1-8] | 32 | 2:8:2 | 128811 | |
- | | achilles4 | + | | paris[1-8] | 32 | 2:8:2 | 128810 | |
- | | achilles5 | 32 | 2:8:2 | 128810 | | + | | hyperion[2-8] | 64 | 2:16:2 | 257667 | |
- | | achilles6 | 32 | 2:8:2 | 128810 | | + | |
- | | achilles7 | 32 | 2:8:2 | 128810 | | + | |
- | | achilles8 | 32 | 2:8:2 | 128810 | | + | |
- | | hector1 | 32 | 2:8:2 | 128810 | | + | |
- | | hector2 | 32 | 2:8:2 | 128810 | | + | |
- | | hector3 | 32 | 2:8:2 | 128810 | | + | |
- | | hector4 | 32 | 2:8:2 | 128810 | | + | |
- | | hector5 | 32 | 2:8:2 | 128810 | | + | |
- | | hector6 | 32 | 2:8:2 | 128810 | | + | |
- | | hector7 | 32 | 2:8:2 | 128810 | | + | |
- | | hector8 | 32 | 2:8:2 | 128810 | | + | |
- | | helena1 | + | |
- | | helena2 | 32 | 2:8:2 | 128811 | | + | |
- | | helena3 | 32 | 2:8:2 | 128811 | | + | |
- | | helena4 | 32 | 2:8:2 | 128811 | | + | |
- | | helena5 | + | |
- | | helena6 | 32 | 2:8:2 | 128811 | | + | |
- | | helena7 | 32 | 2:8:2 | 128810 | | + | |
- | | helena8 | 32 | 2:8:2 | 128811 | | + | |
- | | paris1 | 32 | 2:8:2 | 128810 | | + | |
- | | paris2 | 32 | 2:8:2 | 128810 | | + | |
- | | paris3 | 32 | 2:8:2 | 128810 | | + | |
- | | paris4 | 32 | 2:8:2 | 128810 | | + | |
- | | paris5 | 32 | 2:8:2 | 128810 | | + | |
- | | paris6 | 32 | 2:8:2 | 128810 | | + | |
- | | paris7 | 32 | 2:8:2 | 128810 | | + | |
- | | paris8 | 32 | 2:8:2 | 128810 | | + | |
- | | hyperion2 | 64 | 2:16:2 | 257667 | | + | |
- | | hyperion3 | 64 | 2:16:2 | 257667 | | + | |
- | | hyperion4 | 64 | 2:16:2 | 257667 | | + | |
- | | hyperion5 | 64 | 2:16:2 | 257667 | | + | |
- | | hyperion6 | 64 | 2:16:2 | 257667 | | + | |
- | | hyperion7 | 64 | 2:16:2 | 257667 | | + | |
- | | hyperion8 | + | |
==== cpu-ms ==== | ==== cpu-ms ==== | ||
| Node name | Thread count | Socket: | | Node name | Thread count | Socket: | ||
| iridium | 16 | 2:4:2 | 515977 | | | iridium | 16 | 2:4:2 | 515977 | | ||
- | | orion1 | 40 | 2:10:2 | 128799 | | + | | orion[1-8] |
- | | orion2 | 40 | 2:10:2 | 128799 | | + | |
- | | orion3 | 40 | 2:10:2 | 128799 | | + | |
- | | orion4 | 40 | 2:10:2 | 128799 | | + | |
- | | orion5 | 40 | 2:10:2 | 128799 | | + | |
- | | orion6 | 40 | 2:10:2 | 128799 | | + | |
- | | orion7 | 40 | 2:10:2 | 128799 | | + | |
- | | orion8 | + | |
==== gpu-troja ==== | ==== gpu-troja ==== | ||
- | | Node name | Thread count | Socket: | + | | Node name | Thread count | Socket: |
- | | tdll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | + | | tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | |
- | | tdll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | | + | | tdll-8gpu[1,2] | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 | |
- | | tdll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | | + | | tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | |
- | | tdll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | + | |
- | | tdll-8gpu1 | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | | + | |
- | | tdll-8gpu2 | + | |
- | | tdll-8gpu3 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | + | |
- | | tdll-8gpu4 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | | + | |
- | | tdll-8gpu5 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | | + | |
- | | tdll-8gpu6 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | | + | |
- | | tdll-8gpu7 | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | + | |
==== gpu-ms ==== | ==== gpu-ms ==== | ||
- | | Node name | Thread count | Socket: | + | | Node name | Thread count | Socket: |
- | | dll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | + | | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | |
- | | dll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | | + | | dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 |
- | | dll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | | + | | dll-4gpu3 | 62 | 1:32:2 | 515652 |
- | | dll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | + | | dll-4gpu4 | 30 | 1:16:2 | 257616 |
- | | dll-3gpu5 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | | + | | dll-8gpu[1, |
- | | dll-4gpu1 | + | | dll-8gpu[3, |
- | | dll-4gpu2 | 40 | 2:10:2 | 187978 | + | | dll-8gpu[5,6] | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 | |
- | | dll-8gpu1 | 64 | 2:16:2 | 515838 | + | | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 |
- | | dll-8gpu2 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | | + | | dll-10gpu[2, |
- | | dll-8gpu3 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | | + | |
- | | dll-8gpu4 | 32 | 2:8:2 | 253721 | gpuram16G gpu_cc8.6 | | + | |
- | | dll-8gpu5 | + | |
- | | dll-8gpu6 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | + | |
- | | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | | + | |
- | | dll-10gpu2 | + | |
- | | dll-10gpu3 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | + | |
Line 121: | Line 75: | ||
#!/bin/bash | #!/bin/bash | ||
#SBATCH -J helloWorld | #SBATCH -J helloWorld | ||
- | #SBATCH -p cpu-troja | + | #SBATCH -p cpu-troja |
#SBATCH -o helloWorld.out | #SBATCH -o helloWorld.out | ||
#SBATCH -e helloWorld.err | #SBATCH -e helloWorld.err | ||
Line 136: | Line 90: | ||
#SBATCH -D / | #SBATCH -D / | ||
#SBATCH -N 2 # number of nodes (default 1) | #SBATCH -N 2 # number of nodes (default 1) | ||
- | #SBATCH --nodelist=node1, | + | #SBATCH --nodelist=node1, |
#SBATCH --cpus-per-task=4 | #SBATCH --cpus-per-task=4 | ||
#SBATCH --gres=gpu: | #SBATCH --gres=gpu: | ||
Line 157: | Line 111: | ||
</ | </ | ||
- | ==== Running | + | === Rudolf' |
+ | |||
+ | The main point is for log files to have the job name and job id in them automatically. | ||
+ | |||
+ | < | ||
+ | #SBATCH -J RuRjob | ||
+ | #SBATCH -o %x.%j.out | ||
+ | #SBATCH -e %x.%j.err | ||
+ | #SBATCH -p gpu-troja | ||
+ | #SBATCH --gres=gpu: | ||
+ | #SBATCH --mem=16G | ||
+ | #SBATCH --constraint=" | ||
+ | |||
+ | # Print each command to STDERR before executing (expanded), prefixed by "+ " | ||
+ | set -o xtrace | ||
+ | </ | ||
+ | |||
+ | ==== Inspecting | ||
In order to inspect all running jobs on the cluster use: | In order to inspect all running jobs on the cluster use: | ||
Line 163: | Line 134: | ||
< | < | ||
squeue | squeue | ||
+ | </ | ||
+ | |||
+ | filter only my jobs | ||
+ | |||
+ | < | ||
+ | squeue --me | ||
</ | </ | ||
Line 238: | Line 215: | ||
In the example above you can see comments at all lines relevant to CPU allocation. | In the example above you can see comments at all lines relevant to CPU allocation. | ||
+ | === Priority ==== | ||
+ | When running srun or sbatch, you can pass '' | ||
+ | The preemption has probably not been used by anyone yet; some documentation about it is on https:// | ||
==== Interactive mode ==== | ==== Interactive mode ==== | ||
Line 254: | Line 234: | ||
* '' | * '' | ||
- | * '' | + | * '' |
**To get interactive job with a single GPU of any kind:** | **To get interactive job with a single GPU of any kind:** | ||
< | < | ||
* '' | * '' | ||
- | * '' | + | * '' |
< | < | ||
* '' | * '' | ||
- | * '' | + | * '' |
- | * '' | + | * Note that e.g. '' |
+ | * '' | ||
< | < | ||
- | * '' | + | * '' |
+ | |||
+ | |||
+ | \\ | ||
+ | **Unexpected Behavior of '' | ||
+ | When you execute a command using '' | ||
+ | < | ||
+ | then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the '' | ||
+ | ==== Delete Job ==== | ||
+ | < | ||
+ | |||
+ | < | ||
To see all the available options type: | To see all the available options type: | ||
- | < | + | < |
+ | |||
+ | ==== Basic commands on cluster machines ==== | ||
+ | |||
+ | lspci | ||
+ | # is any such hardware there? | ||
+ | nvidia-smi | ||
+ | # more details, incl. running processes on the GPU | ||
+ | # nvidia-* are typically located in /usr/bin | ||
+ | watch nvidia-smi | ||
+ | # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!) | ||
+ | # You can also use nvidia-smi -l TIME | ||
+ | nvcc --version | ||
+ | # this should tell CUDA version | ||
+ | # nvcc is typically installed in / | ||
+ | theano-test | ||
+ | # dela to vubec neco uzitecneho? :-) | ||
+ | # theano-* are typically located in / | ||
+ | / | ||
+ | # shows CUDA capability etc. | ||
+ | ssh dll1; ~popel/ | ||
+ | # who occupies which card on a given machine | ||
+ | |||
===== See also ===== | ===== See also ===== |