Differences
This shows you the differences between two versions of the page.
Both sides previous revision
Previous revision
|
Next revision
Both sides next revision
|
gpu [2019/08/27 19:10] popel [Rules] |
gpu [2019/08/27 19:12] popel [Rules] |
* For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). **Warning: ''-pty yes bash -l'' is necessary**, otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q 'gpu*' -l gpu=1,gpu_ram=2G -pty yes bash -l</code>In general: don't reserve a GPU (as described above) without actually using it for longer time. (E.g. try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.) Ondřej Bojar has a script /home/bojar/tools/servers/watch_gpus for watching reserved but unused GPU on most machines which will e-mail you, but don't rely on it only. | * For interactive jobs, you can use ''qrsh'', but make sure to end your job as soon as you don't need the GPU (so don't use qrsh for long training). **Warning: ''-pty yes bash -l'' is necessary**, otherwise the variable ''$CUDA_VISIBLE_DEVICES'' will not be set correctly. E.g. <code>qrsh -q 'gpu*' -l gpu=1,gpu_ram=2G -pty yes bash -l</code>In general: don't reserve a GPU (as described above) without actually using it for longer time. (E.g. try separating steps which need GPU and steps which do not and execute those separately on our GPU resp. CPU cluster.) Ondřej Bojar has a script /home/bojar/tools/servers/watch_gpus for watching reserved but unused GPU on most machines which will e-mail you, but don't rely on it only. |
* Note that the dll machines have typically 10 cards, but "just" 250 GB RAM (DLL7 has only 128 GB). So the expected (maximal) ''mem_free'' requirement for jobs with 1 GPU is 25GB. If your 1-GPU job takes e.g. 80 GB and you submit three such jobs on the same machine, you have effectively blocked the whole machine and seven GPUs remain unused. If you really need to submit more high-memory jobs, send each on a different machine. | * Note that the dll machines have typically 10 cards, but "just" 250 GB RAM (DLL7 has only 128 GB). So the expected (maximal) ''mem_free'' requirement for jobs with 1 GPU is 25GB. If your 1-GPU job takes e.g. 80 GB and you submit three such jobs on the same machine, you have effectively blocked the whole machine and seven GPUs remain unused. If you really need to submit more high-memory jobs, send each on a different machine. |
* If you know an approximate runtime of your job, please specify it as a soft constraint with ''-l s_rt=hh:mm:ss''. It will help SGE to better schedule the jobs, especially multi-gpu reservations (see ''qconf -ssconf''). | * If you know an approximate runtime of your job, please specify it with ''-l s_rt=hh:mm:ss'' - this is a soft constraint so your job won't be killed if it runs longer than specified. It will help SGE to better schedule the jobs, especially multi-gpu reservations (see ''qconf -ssconf''). |
| |
| |