Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
grid [2017/09/27 20:45] popel [Rules] |
grid [2017/09/27 21:16] popel [Advanced usage] |
* If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> As you can see in [[#List of Machines]], the maximum is 32 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste). TODO: when using ''-pe smp -l mf=8G,amf=8G,h_vmem=12G'', which memory limits are per machine and which are per core? | * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> As you can see in [[#List of Machines]], the maximum is 32 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste). TODO: when using ''-pe smp -l mf=8G,amf=8G,h_vmem=12G'', which memory limits are per machine and which are per core? |
* If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh'') with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code> | * If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh'') with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code> |
* Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with ''qstat -g c'' or ''qstat -u \*'') and/or ask your colleagues if they don't plan to use the cluster intensively in the near future. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores. If you are submitting **more than 100 jobs**, consider using setting them a low priority (''-p -1024'', see below) or use [[#qunhold]]. Don't submit more than ca 2000 jobs at once (this can overload the SGE). | * Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with ''qstat -g c'' or ''qstat -u \*'') and/or ask your colleagues if they don't plan to use the cluster intensively in the near future. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores. If you are submitting **more than 100 jobs**, consider using setting them a low priority (e.g. ''-p -1024'', see below) or use [[#qunhold]]. Don't submit more than ca 2000 jobs at once (this can overload the SGE). |
| |
| |
| |
===== Advanced usage ===== | ===== Advanced usage ===== |
| |
| ''qsub **-q** troja-all.q'' |
| This way your job is submitted to the Troja queue. The default is ''ms-all.q''. You can also use e.g. |
| ''-q '(troja*|ms*)''' to submit on any machine in those two queues (but **don't use ''-q '*'''** as this includes also [[:gpu|gpu.q]]), |
| ''-q '*@hector[14]''' to submit on hecor1 or hector4, |
| ''-q '[tm]*@!(hector*|iridium)''' to submit on any troja/ms machine except hectors and iridium. |
| However, usually you should specify just the queue (troja-all.q vs. ms-all.q), not a particular machine, and instead use ''-l'' to specify the needed resources in a general way. |
| |
| ''qsub **-l** ...'' |
| See ''man complex'' (run it on lrc or sol machines) for a list of possible resources you may require (in addition to ''mem_free'' etc. discussed above). |
| |
''qsub **-p** -100'' | ''qsub **-p** -100'' |
''qsub **-soft** -l ... **-hard** -l ... -q ...'' | ''qsub **-soft** -l ... **-hard** -l ... -q ...'' |
By default, all the resource requirements (specified with ''-l'') and queue requirements (specified with ''-q'') are //hard//, i.e. your job won't be scheduled unless they can be fulfilled. You can use ''-soft'' to mark all following requirements as nice-to-have. And with ''-hard'' you can switch back to hard requirements. | By default, all the resource requirements (specified with ''-l'') and queue requirements (specified with ''-q'') are //hard//, i.e. your job won't be scheduled unless they can be fulfilled. You can use ''-soft'' to mark all following requirements as nice-to-have. And with ''-hard'' you can switch back to hard requirements. |
| |
| ''qsub **-sync** y'' |
| This causes qsub to wait for the job to complete before exiting (with the same exit code as the job). Useful in scripts. |
| |
''**qalter**'' | ''**qalter**'' |
=== Array jobs === | === Array jobs === |
| |
If you have a set of tasks (of the same type) and want to run them on multiple machines,... | If you have a set of tasks (of the same type) and want to run them on multiple machines, use ''qsub -t''... |
TODO | TODO |
| |
===== Monitorování úloh ===== | ===== Monitorování úloh ===== |
| |
| |
Pokud chcete submittovaný program pouštět ve svém oblíbeném prostředí (např. nastavení ''PATH''), musíte v obalujícím skriptu příslušné ''.bash*'' načíst. Vždy je ale bezpečnější všude psát plné cesty, než spoléhat na PATH. | Pokud chcete submittovaný program pouštět ve svém oblíbeném prostředí (např. nastavení ''PATH''), musíte v obalujícím skriptu příslušné ''.bash*'' načíst. Vždy je ale bezpečnější všude psát plné cesty, než spoléhat na PATH. |
| |
==== Jiný shell ==== | |
| |
Abych mohl poslat nějakou úlohu do fronty, musím pro ni vyrobit vlastní skript. Budiž, vyrobil jsem vlastní skript: | |
| |
<code> | |
#!/bin/bash | |
program > log.out 2> log.err | |
</code> | |
| |
Když tento skript spustím, stane se očekávané. Přesměrují se výstupy z daného programu do souborů a je to. | |
| |
Když takový skript submitnu, program se **nespustí**. V logu zjistím, že (standardní chybový) výstup shellu, který pouštěl můj skript praví kryptickou zprávu "Ambiguous redirect". | |
| |
Nebudu vás napínat, zde je vysvětlení: SGE ignoruje první řádek skriptu (ve skutečnosti je pravda horší, hledá v něm nějaké parametry pro sebe) a spouští skript v ''csh''. Rozdíl mezi bashem a csh se v primitivních skriptech na první pohled nepozná, pozná se až v konstrukci if-then-else, a také v přesměrovávání. csh nerozumí přesměrování ''2>'' | |
| |
Takto SGE přinutíte, aby použilo bash: | |
| |
<code> | |
qsub -S /bin/bash skript | |
</code> | |
| |
Jinou možností je přesměrovat stderr a stdout pomocí syntaxe csh: | |
| |
<code> | |
( command >stdout_file ) >&stderr_file | |
</code> | |
| |
| |
==== bashrc a podobné nesmí nic vypisovat na konzoli ==== | ==== bashrc a podobné nesmí nic vypisovat na konzoli ==== |
hard resource_list: mem_free=16g | hard resource_list: mem_free=16g |
hard resource_list: mem_free=31g</code> | hard resource_list: mem_free=31g</code> |
| |
==== Jak rezervovat více jader na stejném stroji pro 1 job ==== | |
| |
<code> | |
qsub -pe smp <pocet jader> | |
</code> | |
| |