Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
slurm [2023/01/19 15:41] vodrazka [gpu-troja] |
slurm [2025/10/15 18:09] (current) straka [ÚFAL Grid Engine (LRC)] |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== ÚFAL Grid Engine (LRC) ====== | ====== ÚFAL Grid Engine (LRC) ====== | ||
| + | |||
| + | **IN 2024: Newly, all the documentation is at a dedicated wiki https:// | ||
| LRC (Linguistic Research Cluster) is the name of ÚFAL' | LRC (Linguistic Research Cluster) is the name of ÚFAL' | ||
| + | |||
| + | See Milan Straka' | ||
| + | |||
| + | * https:// | ||
| + | * https:// | ||
| Currently there are following partitions (queues) available for computing: | Currently there are following partitions (queues) available for computing: | ||
| Line 31: | Line 38: | ||
| | Node name | Thread count | Socket: | | Node name | Thread count | Socket: | ||
| - | | dll-3gpu1 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | + | | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | |
| - | | dll-3gpu2 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | + | | dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 | |
| - | | dll-3gpu3 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | + | | dll-4gpu3 | 62 | 1:32:2 | 515652 |
| - | | dll-3gpu4 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | + | | dll-4gpu4 | 30 | 1:16:2 | 257616 |
| - | | dll-3gpu5 | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | + | | dll-8gpu[1, |
| - | | dll-4gpu1 | + | | dll-8gpu[3, |
| - | | dll-4gpu2 | 40 | 2:10:2 | 187978 | + | | dll-8gpu[5,6] | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 | |
| - | | dll-8gpu1 | 64 | 2:16:2 | 515838 | + | |
| - | | dll-8gpu2 | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 | | + | |
| - | | dll-8gpu3 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | + | |
| - | | dll-8gpu4 | 32 | 2:8:2 | 253721 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | + | |
| - | | dll-8gpu5 | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 | | + | |
| - | | dll-8gpu6 | + | |
| | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | | ||
| - | | dll-10gpu2 | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti | | + | | dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti | |
| - | | dll-10gpu3 | + | |
| Line 110: | Line 110: | ||
| </ | </ | ||
| - | ==== Running | + | === Rudolf' |
| + | |||
| + | The main point is for log files to have the job name and job id in them automatically. | ||
| + | |||
| + | < | ||
| + | #SBATCH -J RuRjob | ||
| + | #SBATCH -o %x.%j.out | ||
| + | #SBATCH -e %x.%j.err | ||
| + | #SBATCH -p gpu-troja | ||
| + | #SBATCH --gres=gpu: | ||
| + | #SBATCH --mem=16G | ||
| + | #SBATCH --constraint=" | ||
| + | |||
| + | # Print each command to STDERR before executing (expanded), prefixed by "+ " | ||
| + | set -o xtrace | ||
| + | </ | ||
| + | |||
| + | ==== Inspecting | ||
| In order to inspect all running jobs on the cluster use: | In order to inspect all running jobs on the cluster use: | ||
| Line 116: | Line 133: | ||
| < | < | ||
| squeue | squeue | ||
| + | </ | ||
| + | |||
| + | filter only my jobs | ||
| + | |||
| + | < | ||
| + | squeue --me | ||
| </ | </ | ||
| Line 191: | Line 214: | ||
| In the example above you can see comments at all lines relevant to CPU allocation. | In the example above you can see comments at all lines relevant to CPU allocation. | ||
| + | === Priority ==== | ||
| + | When running srun or sbatch, you can pass '' | ||
| + | The preemption has probably not been used by anyone yet; some documentation about it is on https:// | ||
| ==== Interactive mode ==== | ==== Interactive mode ==== | ||
| Line 223: | Line 249: | ||
| * '' | * '' | ||
| + | |||
| + | \\ | ||
| + | **Unexpected Behavior of '' | ||
| + | When you execute a command using '' | ||
| + | < | ||
| + | then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the '' | ||
| ==== Delete Job ==== | ==== Delete Job ==== | ||
| < | < | ||
| + | |||
| + | < | ||
| + | |||
| To see all the available options type: | To see all the available options type: | ||
| - | < | + | < |
| + | |||
| + | ==== Basic commands on cluster machines ==== | ||
| + | |||
| + | lspci | ||
| + | # is any such hardware there? | ||
| + | nvidia-smi | ||
| + | # more details, incl. running processes on the GPU | ||
| + | # nvidia-* are typically located in /usr/bin | ||
| + | watch nvidia-smi | ||
| + | # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!) | ||
| + | # You can also use nvidia-smi -l TIME | ||
| + | nvcc --version | ||
| + | # this should tell CUDA version | ||
| + | # nvcc is typically installed in / | ||
| + | theano-test | ||
| + | # dela to vubec neco uzitecneho? :-) | ||
| + | # theano-* are typically located in / | ||
| + | / | ||
| + | # shows CUDA capability etc. | ||
| + | ssh dll1; ~popel/ | ||
| + | # who occupies which card on a given machine | ||
| + | |||
| ===== See also ===== | ===== See also ===== | ||
