Differences

This shows you the differences between two versions of the page.

--- grid [2017/10/10 13:46]
belohlavek [Other]
+++ grid [2018/03/01 22:04]
kruza [Advanced usage]
@@ Line 55: / Line 55: @@
   [ -f ~/.bashrc ] && source ~/.bashrc
+Make sure you have correctly configured locale (otherwise ''qrsh'' may not show accented letters in ''less'' and you may get errors when printing utf8 on stdout/stderr from your script in ''qsub''). For example add the following line to your ''~/.bashrc'':
+  export LC_ALL=en_US.UTF-8
 ===== Basic usage =====
@@ Line 119: / Line 122: @@
     * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> As you can see in [[#List of Machines]], the maximum is 32 cores. If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste). TODO: when using ''-pe smp -l mf=8G,amf=8G,h_vmem=12G'', which memory limits are per machine and which are per core?
     * If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh'') with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code>
-  * Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with ''qstat -g c'' or ''qstat -u \*'') and/or ask your colleagues if they don't plan to use the cluster intensively in the near future. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores. If you are submitting **more than 100 jobs**, consider using setting them a low priority (e.g. ''-p -1024'', see below) or use [[#qunhold]]. Don't submit more than ca 2000 jobs at once (this can overload the SGE).
+  * Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with ''qstat -g c'' or ''qstat -u \*'') and/or ask your colleagues if they don't plan to use the cluster intensively in the near future. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores. If you are submitting **more than 100 jobs**, consider using setting them a low priority (e.g. ''-p -1024'', see below) or use [[#qunhold]].
+  * **Don't submit more than ca 5000 jobs at once**, even if you make sure that at most 100 are running/waiting and the rest is in the //hold// state (e.g. using ''qunhold''). More than 5000 jobs in the queue can overload the SGE, so then no one can execute ''qstat'' (or it takes too long).
@@ Line 138: / Line 142: @@
 This way your job is submitted to the Troja queue. The default is ''ms-all.q''. You can also use e.g.
 ''-q '(troja*|ms*)''' to submit on any machine in those two queues (but **don't use ''-q '*'''** as this includes also [[:gpu|gpu.q]]),
-''-q '*@hector[14]''' to submit on hecor1 or hector4,
+''-q '*@hector[14]''' to submit on hector1 or hector4,
 ''-q '[tm]*@!(hector*|iridium)''' to submit on any troja/ms machine except hectors and iridium.
 However, usually you should specify just the queue (troja-all.q vs. ms-all.q), not a particular machine, and instead use ''-l'' to specify the needed resources in a general way.
@@ Line 145: / Line 149: @@
 See ''man complex'' (run it on lrc or sol machines) for a list of possible resources you may require (in addition to ''mem_free'' etc. discussed above).
-''qsub **-p** -100''
+''qsub **-p** -99''
-Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. The default is 0, i.e. the highest possible priority. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called ''prior'', which is reported in ''qstat'', which grows as the job is waiting in the queue). Note that once a job is started, you cannot "unschedule" it, so from that moment on, it is irrelevant what was its priority.
+Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. In January 2018, we changed the default to -100 (it used to be 0). SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called ''prior'', which is reported in ''qstat'', which grows as the job is waiting in the queue). Note that once a job is started, you cannot "unschedule" it, so from that moment on, it is irrelevant what was its priority. You can ask for a higher priority (-99...0) if your job is urgent and/or will finish soon and you want to skip your colleagues' jobs in the queue. You should ask for lower priority (-1024..-101) if you submit many jobs at once or if the jobs are not urgent.
 ''qsub **-o** LOG.stdout **-e** LOG.stderr''
@@ Line 208: / Line 212: @@
   # immediately on the remote machine
   history -a;
-  # setup the working directory by setting WD
+  # setup the working directory by setting WD, delete possible ".nfs/"
-  ssh -X -Y -C -t $@ "WD='$PWD' /bin/bash --login -i";
+  ssh -X -Y -C -t $@ "WD='${PWD/.nfs\//}' /bin/bash --login -i";
 }
@@ Line 243: / Line 247: @@
   * ''-tc j'' run at most j jobs simultaneously
   * ''-hold_jid_ad comma_separated_job_list'' array jobs that must finish before this job starts; task //i// of the current job depends only on task //i// of the specified jobs
+=== Delete many jobs at once ===
+For deleting all your jobs use ''qdel \*''. For deleting array jobs (see above), you can use comma-separated ranges of task ids, e.g. ''qdel -t 1,10,50-100''.
+For deleting a range of (normal) job ids, you can use bash expansion (as an alternative to ''seq''), e.g. ''qdel {17979..18028}''.
+=== Ssh to random sol ===
+Ondřej Bojar suggests to add the following alias to your .bashrc (cf. [[#sshcwd]]):
+<code>alias cluster='comp=$(($RANDOM /4095 +1)); ssh -o "StrictHostKeyChecking no" sol$comp'</code>
 ===== Job monitoring =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences