Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
grid [2017/10/10 13:46] belohlavek [Other] |
grid [2018/03/01 22:04] kruza [Advanced usage] |
[ -f ~/.bashrc ] && source ~/.bashrc | [ -f ~/.bashrc ] && source ~/.bashrc |
| |
| Make sure you have correctly configured locale (otherwise ''qrsh'' may not show accented letters in ''less'' and you may get errors when printing utf8 on stdout/stderr from your script in ''qsub''). For example add the following line to your ''~/.bashrc'': |
| |
| export LC_ALL=en_US.UTF-8 |
===== Basic usage ===== | ===== Basic usage ===== |
| |
* If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> As you can see in [[#List of Machines]], the maximum is 32 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste). TODO: when using ''-pe smp -l mf=8G,amf=8G,h_vmem=12G'', which memory limits are per machine and which are per core? | * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> As you can see in [[#List of Machines]], the maximum is 32 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste). TODO: when using ''-pe smp -l mf=8G,amf=8G,h_vmem=12G'', which memory limits are per machine and which are per core? |
* If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh'') with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code> | * If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh'') with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code> |
* Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with ''qstat -g c'' or ''qstat -u \*'') and/or ask your colleagues if they don't plan to use the cluster intensively in the near future. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores. If you are submitting **more than 100 jobs**, consider using setting them a low priority (e.g. ''-p -1024'', see below) or use [[#qunhold]]. Don't submit more than ca 2000 jobs at once (this can overload the SGE). | * Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with ''qstat -g c'' or ''qstat -u \*'') and/or ask your colleagues if they don't plan to use the cluster intensively in the near future. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores. If you are submitting **more than 100 jobs**, consider using setting them a low priority (e.g. ''-p -1024'', see below) or use [[#qunhold]]. |
| * **Don't submit more than ca 5000 jobs at once**, even if you make sure that at most 100 are running/waiting and the rest is in the //hold// state (e.g. using ''qunhold''). More than 5000 jobs in the queue can overload the SGE, so then no one can execute ''qstat'' (or it takes too long). |
| |
| |
This way your job is submitted to the Troja queue. The default is ''ms-all.q''. You can also use e.g. | This way your job is submitted to the Troja queue. The default is ''ms-all.q''. You can also use e.g. |
''-q '(troja*|ms*)''' to submit on any machine in those two queues (but **don't use ''-q '*'''** as this includes also [[:gpu|gpu.q]]), | ''-q '(troja*|ms*)''' to submit on any machine in those two queues (but **don't use ''-q '*'''** as this includes also [[:gpu|gpu.q]]), |
''-q '*@hector[14]''' to submit on hecor1 or hector4, | ''-q '*@hector[14]''' to submit on hector1 or hector4, |
''-q '[tm]*@!(hector*|iridium)''' to submit on any troja/ms machine except hectors and iridium. | ''-q '[tm]*@!(hector*|iridium)''' to submit on any troja/ms machine except hectors and iridium. |
However, usually you should specify just the queue (troja-all.q vs. ms-all.q), not a particular machine, and instead use ''-l'' to specify the needed resources in a general way. | However, usually you should specify just the queue (troja-all.q vs. ms-all.q), not a particular machine, and instead use ''-l'' to specify the needed resources in a general way. |
See ''man complex'' (run it on lrc or sol machines) for a list of possible resources you may require (in addition to ''mem_free'' etc. discussed above). | See ''man complex'' (run it on lrc or sol machines) for a list of possible resources you may require (in addition to ''mem_free'' etc. discussed above). |
| |
''qsub **-p** -100'' | ''qsub **-p** -99'' |
Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. The default is 0, i.e. the highest possible priority. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called ''prior'', which is reported in ''qstat'', which grows as the job is waiting in the queue). Note that once a job is started, you cannot "unschedule" it, so from that moment on, it is irrelevant what was its priority. | Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. In January 2018, we changed the default to -100 (it used to be 0). SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called ''prior'', which is reported in ''qstat'', which grows as the job is waiting in the queue). Note that once a job is started, you cannot "unschedule" it, so from that moment on, it is irrelevant what was its priority. You can ask for a higher priority (-99...0) if your job is urgent and/or will finish soon and you want to skip your colleagues' jobs in the queue. You should ask for lower priority (-1024..-101) if you submit many jobs at once or if the jobs are not urgent. |
| |
''qsub **-o** LOG.stdout **-e** LOG.stderr'' | ''qsub **-o** LOG.stdout **-e** LOG.stderr'' |
# immediately on the remote machine | # immediately on the remote machine |
history -a; | history -a; |
# setup the working directory by setting WD | # setup the working directory by setting WD, delete possible ".nfs/" |
ssh -X -Y -C -t $@ "WD='$PWD' /bin/bash --login -i"; | ssh -X -Y -C -t $@ "WD='${PWD/.nfs\//}' /bin/bash --login -i"; |
} | } |
| |
* ''-tc j'' run at most j jobs simultaneously | * ''-tc j'' run at most j jobs simultaneously |
* ''-hold_jid_ad comma_separated_job_list'' array jobs that must finish before this job starts; task //i// of the current job depends only on task //i// of the specified jobs | * ''-hold_jid_ad comma_separated_job_list'' array jobs that must finish before this job starts; task //i// of the current job depends only on task //i// of the specified jobs |
| |
| === Delete many jobs at once === |
| For deleting all your jobs use ''qdel \*''. For deleting array jobs (see above), you can use comma-separated ranges of task ids, e.g. ''qdel -t 1,10,50-100''. |
| For deleting a range of (normal) job ids, you can use bash expansion (as an alternative to ''seq''), e.g. ''qdel {17979..18028}''. |
| |
| === Ssh to random sol === |
| Ondřej Bojar suggests to add the following alias to your .bashrc (cf. [[#sshcwd]]): |
| <code>alias cluster='comp=$(($RANDOM /4095 +1)); ssh -o "StrictHostKeyChecking no" sol$comp'</code> |
| |
===== Job monitoring ===== | ===== Job monitoring ===== |