Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
slurm [2022/10/25 14:01] vodrazka [ÚFAL Grid Engine (LRC)] |
slurm [2023/01/19 15:44] vodrazka [gpu-ms] |
===== Node list by partitions ===== | ===== Node list by partitions ===== |
| |
| The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-**X**gpu**N** where **X** gives total number of GPUs equipped and **N** is just enumerating the order of the node with the given configuration. |
| The prefix **t** is for nodes at Troja and **dll** stands for Deep Learning Laboratory. |
==== cpu-troja ==== | ==== cpu-troja ==== |
| |
| | Node name | Thread count | Socket:Core:Thread | RAM (MB) | |
| | achilles[1-8] | 32 | 2:8:2 | 128810 | |
| | hector[1-8] | 32 | 2:8:2 | 128810 | |
| | helena[1-8] | 32 | 2:8:2 | 128811 | |
| | paris[1-8] | 32 | 2:8:2 | 128810 | |
| | hyperion[2-8] | 64 | 2:16:2 | 257667 | |
==== cpu-ms ==== | ==== cpu-ms ==== |
| |
| | Node name | Thread count | Socket:Core:Thread | RAM (MB) | |
| | iridium | 16 | 2:4:2 | 515977 | |
| | orion[1-8] | 40 | 2:10:2 | 128799 | |
==== gpu-troja ==== | ==== gpu-troja ==== |
| |
| | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type | |
| | tdll-3gpu[1-4] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | |
| | tdll-8gpu[1,2] | 64 | 2:16:2 | 257666 | gpuram40G gpu_cc8.0 | NVIDIA A100 | |
| | tdll-8gpu[3-7] | 32 | 2:8:2 | 253725 | gpuram16G gpu_cc7.5 | NVIDIA Quadro P5000 | |
==== gpu-ms ==== | ==== gpu-ms ==== |
| |
| | Node name | Thread count | Socket:Core:Thread | RAM (MB) | Features | GPU type | |
| | dll-3gpu[1-5] | 64 | 2:16:2 | 128642 | gpuram48G gpu_cc8.6 | NVIDIA A40 | |
| | dll-4gpu[1,2] | 40 | 2:10:2 | 187978 | gpuram24G gpu_cc8.6 | NVIDIA RTX 3090 | |
| | dll-8gpu[1,2] | 64 | 2:16:2 | 515838 | gpuram24G gpu_cc8.0 | NVIDIA A30 | |
| | dll-8gpu[3,4] | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | |
| | dll-8gpu[5,6] | 40 | 2:10:2 | 385595 | gpuram16G gpu_cc7.5 | NVIDIA Quadro RTX 5000 | |
| | dll-10gpu1 | 32 | 2:8:2 | 257830 | gpuram16G gpu_cc8.6 | NVIDIA RTX A4000 | |
| | dll-10gpu[2,3] | 32 | 2:8:2 | 257830 | gpuram11G gpu_cc6.1 | NVIDIA GeForce GTX 1080 Ti | |
| |
| |
| ==== Submit nodes ==== |
| |
| |
In order to submit a job you need to login to one of the head nodes: | In order to submit a job you need to login to one of the head nodes: |
lrc1.ufal.hide.ms.mff.cuni.cz | lrc1.ufal.hide.ms.mff.cuni.cz |
lrc2.ufal.hide.ms.mff.cuni.cz | lrc2.ufal.hide.ms.mff.cuni.cz |
| sol1.ufal.hide.ms.mff.cuni.cz |
| sol2.ufal.hide.ms.mff.cuni.cz |
| sol3.ufal.hide.ms.mff.cuni.cz |
| sol4.ufal.hide.ms.mff.cuni.cz |
===== Basic usage ===== | ===== Basic usage ===== |
| |
#!/bin/bash | #!/bin/bash |
#SBATCH -J helloWorld # name of job | #SBATCH -J helloWorld # name of job |
#SBATCH -p cpu-troja # name of partition or queue (if not specified default partition is used) | #SBATCH -p cpu-troja # name of partition or queue (default=cpu-troja) |
#SBATCH -o helloWorld.out # name of output file for this submission script | #SBATCH -o helloWorld.out # name of output file for this submission script |
#SBATCH -e helloWorld.err # name of error file for this submission script | #SBATCH -e helloWorld.err # name of error file for this submission script |
#SBATCH -D /some/path/ # change directory before executing the job | #SBATCH -D /some/path/ # change directory before executing the job |
#SBATCH -N 2 # number of nodes (default 1) | #SBATCH -N 2 # number of nodes (default 1) |
#SBATCH --nodelist=node1,node2... # required node, or comma separated list of required nodes | #SBATCH --nodelist=node1,node2... # execute on *all* the specified nodes (and possibly more) |
#SBATCH --cpus-per-task=4 # number of cores/threads per task (default 1) | #SBATCH --cpus-per-task=4 # number of cores/threads per task (default 1) |
#SBATCH --gres=gpu:1 # number of GPUs to request (default 0) | #SBATCH --gres=gpu:1 # number of GPUs to request (default 0) |
| |
* ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition. | * ''-p cpu-troja'' explicitly requires partition ''cpu-troja''. If not specified slurm will use default partition. |
* ''--mem=64G'' requires 64G of memory for the job | * ''-''''-mem=64G'' requires 64G of memory for the job |
| |
**To get interactive job with a single GPU of any kind:** | **To get interactive job with a single GPU of any kind:** |
<code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code> | <code>srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash</code> |
* ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions | * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions |
* ''--gres=gpu:1'' requires 1 GPUs | * ''-''''-gres=gpu:1'' requires 1 GPUs |
| |
<code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code> | <code>srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash</code> |
* ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions | * ''-p gpu-troja,gpu-ms'' require only nodes from these two partitions |
* ''--nodelist=tdll-3gpu1'' explicitly requires one specific node | * ''-''''-nodelist=tdll-3gpu1'' explicitly requires one specific node |
* ''--gres=gpu:2'' requires 2 GPUs | * Note that e.g. ''-''''-nodelist=tdll-3gpu[1-4]'' would execute 4 jobs on **all** the four machines ''tdll-3gpu[1-4]''. The documentation says "The job will contain all of these hosts and possibly additional hosts as needed to satisfy resource requirements." I am not aware of any [[https://stackoverflow.com/a/37555321/3310232|simple way]] how to specify that **any** of the listed nodes can be used, i.e. an equivalent of SGE ''-q '*@hector[14]'''. |
| * ''-''''-gres=gpu:2'' requires 2 GPUs |
| |
<code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code> | <code>srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash</code> |
* ''--constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined | * ''-''''-constraint="gpuram48G|gpuram40G"'' only consider nodes that have either ''gpuram48G'' or ''gpuram40G'' feature defined |
| |
| ==== Delete Job ==== |
| <code>scancel <job_id> </code> |
| |
To see all the available options type: | To see all the available options type: |