Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
slurm [2023/01/19 15:44] vodrazka [gpu-ms] |
slurm [2024/10/02 15:22] (current) popel |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== ÚFAL Grid Engine (LRC) ====== | ====== ÚFAL Grid Engine (LRC) ====== | ||
+ | |||
+ | **IN 2024: Newly, all the documentation is at a dedicated wiki https:// | ||
LRC (Linguistic Research Cluster) is the name of ÚFAL' | LRC (Linguistic Research Cluster) is the name of ÚFAL' | ||
+ | |||
+ | See Milan Straka' | ||
+ | |||
+ | * https:// | ||
+ | * https:// | ||
+ | * https:// | ||
Currently there are following partitions (queues) available for computing: | Currently there are following partitions (queues) available for computing: | ||
Line 33: | Line 41: | ||
| dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | ||
| dll-4gpu[1, | | dll-4gpu[1, | ||
+ | | dll-4gpu3 | 62 | 1:32:2 | 515652 | gpuram48G gpu_cc8.9 | NVIDIA L40 | | ||
+ | | dll-4gpu4 | 30 | 1:16:2 | 257616 | gpuram48G gpu_cc8.6 | NVIDIA A40 | | ||
| dll-8gpu[1, | | dll-8gpu[1, | ||
| dll-8gpu[3, | | dll-8gpu[3, | ||
Line 101: | Line 111: | ||
</ | </ | ||
- | ==== Running | + | === Rudolf' |
+ | |||
+ | The main point is for log files to have the job name and job id in them automatically. | ||
+ | |||
+ | < | ||
+ | #SBATCH -J RuRjob | ||
+ | #SBATCH -o %x.%j.out | ||
+ | #SBATCH -e %x.%j.err | ||
+ | #SBATCH -p gpu-troja | ||
+ | #SBATCH --gres=gpu: | ||
+ | #SBATCH --mem=16G | ||
+ | #SBATCH --constraint=" | ||
+ | |||
+ | # Print each command to STDERR before executing (expanded), prefixed by "+ " | ||
+ | set -o xtrace | ||
+ | </ | ||
+ | |||
+ | ==== Inspecting | ||
In order to inspect all running jobs on the cluster use: | In order to inspect all running jobs on the cluster use: | ||
Line 107: | Line 134: | ||
< | < | ||
squeue | squeue | ||
+ | </ | ||
+ | |||
+ | filter only my jobs | ||
+ | |||
+ | < | ||
+ | squeue --me | ||
</ | </ | ||
Line 182: | Line 215: | ||
In the example above you can see comments at all lines relevant to CPU allocation. | In the example above you can see comments at all lines relevant to CPU allocation. | ||
+ | === Priority ==== | ||
+ | When running srun or sbatch, you can pass '' | ||
+ | The preemption has probably not been used by anyone yet; some documentation about it is on https:// | ||
==== Interactive mode ==== | ==== Interactive mode ==== | ||
Line 214: | Line 250: | ||
* '' | * '' | ||
+ | |||
+ | \\ | ||
+ | **Unexpected Behavior of '' | ||
+ | When you execute a command using '' | ||
+ | < | ||
+ | then the command is actually executed **twice in parallel**. To avoid it, you have to either **remove the '' | ||
==== Delete Job ==== | ==== Delete Job ==== | ||
< | < | ||
+ | |||
+ | < | ||
+ | |||
To see all the available options type: | To see all the available options type: | ||
- | < | + | < |
+ | |||
+ | ==== Basic commands on cluster machines ==== | ||
+ | |||
+ | lspci | ||
+ | # is any such hardware there? | ||
+ | nvidia-smi | ||
+ | # more details, incl. running processes on the GPU | ||
+ | # nvidia-* are typically located in /usr/bin | ||
+ | watch nvidia-smi | ||
+ | # For monitoring GPU activity in a separate terminal (thanks to Jindrich Libovicky for this!) | ||
+ | # You can also use nvidia-smi -l TIME | ||
+ | nvcc --version | ||
+ | # this should tell CUDA version | ||
+ | # nvcc is typically installed in / | ||
+ | theano-test | ||
+ | # dela to vubec neco uzitecneho? :-) | ||
+ | # theano-* are typically located in / | ||
+ | / | ||
+ | # shows CUDA capability etc. | ||
+ | ssh dll1; ~popel/ | ||
+ | # who occupies which card on a given machine | ||
+ | |||
===== See also ===== | ===== See also ===== |