[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
grid [2017/09/26 20:38]
popel
grid [2017/09/27 14:01]
popel [Rules]
Line 104: Line 104:
 <code> <code>
 qdel 121144 qdel 121144
-  # This way you can delete a job with a given number, or comma-or-space separated list of job numbers.+  # This way you can delete ("kill"a job with a given number, or comma-or-space separated list of job numbers.
 qdel \* qdel \*
   # This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs.   # This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs.
Line 110: Line 110:
  
 ===== Rules ===== ===== Rules =====
 +The purpose of these rules is to prevent your jobs to damage the work of your colleagues and to divide the resources among users in a fair way.
  
 +  * Read about our [[internal:linux-network|network]] first (so you know that e.g. reading big data from your home in 200 parallel jobs is not a good idea). Ask your colleagues (possibly via [[internal:mailing-lists|devel]]) if you are not sure, esp. if you plan to submit jobs with unusual/extreme disk/mem/CPU requirements.
 +  * While your jobs are running (or queued), check your jobs (esp. previously untested setups) and your email (esp. [[internal:mailing-lists|devel]]) regularly. If you really need to leave e.g. for two-week vacation offline, consult it first with it@ufal (whether they can kill your jobs if needed).
   * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using ''htop'' etc.).   * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using ''htop'' etc.).
-  * However, **never execute any computing manually** on a cluster machine, unless via ''qsub'' or ''qrsh''. If you break this rule, your task will take CPU and memory, but the SGE will not know, so it may schedule other users' jobs on the same machine and **their jobs may fail** or run slowly. The sol machines are an exception from this rule. +  * However, **never execute any computing manually** on a cluster machine where you are sshed (i.e. not via ''qsub'' or ''qrsh''). If you break this rule, your task will take CPU and memory, but the SGE will not know, so it may schedule other users' jobs on the same machine and **their jobs may fail** or run slowly. The sol machines are an exception from this rule. 
-  * For interactive work, you can use ''qrsh'', but please try to end the job (exit with Ctrl+D) once finished with your work, especially if you ask for a lot of memory or CPUs (see below). +  * For interactive work, you can use ''qrsh'', but please try to end the job (exit with Ctrl+D) once finished with your work, especially if you ask for a lot of memory or CPUs (see below). One semi-permanent qrsh job (with non-extreme CPU/mem requirements) per user is acceptable
-  * **Specify the memory requirements** of your job, e.g. with <code>qsub -hard -l mem_free=8G -l act_mem_free=8G -l h_vmem=8G</code> +  * **Specify the memory and CPU requirements** (if higher than the defaults) and **don't exceed them**. 
- +    * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> (As you can see in [[#List of Machines]], the maximum is 32 cores). If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%it is OK to reserve just one core (so you don't waste). 
-TODO explanation +    * If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh''with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -hard -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code>  
-Informovat SGEkolik paměti úloha žereaby na strojích nedošla paměť (a používat "hard" limitkdy SGE úlohu zabije, pokud rezervovanou paměť překročí):  +  * Be kind to your colleagues. If you are going to submit jobs that effectively take more than one fifth of our cluster for more than several hourscheck if it is free (with ''qstat -g c'' or ''qstat -u \*'')ask your colleagues. Note that if you allocate one slot (CPU core) on machinebut (almostall its RAM, you have effectively occupied the whole machine and all its cores.
  
 +  
 Další doporučení: Další doporučení:
-  * Pokud možno používat ''nice''. 
-      *  Dotaz: jak se kombinuje ''nice'' s ''qsub''em? SGE je snad nyní nastaveno tak, že vše bude nicenuté. Každopádně je dobré do submitovaného skriptu na začátek napsat ''renice 10 $$''. 
   * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí.   * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí.
   * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server to pak nepěkně zpomalí pro všechny. Distribuujte tedy i data.   * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server to pak nepěkně zpomalí pro všechny. Distribuujte tedy i data.
   * Pokud chci spouštět úlohy, které poběží dlouhou dobu (hodiny, dny), nepustím je všechny najednou, aby cluster mohli využívat i ostatní.   * Pokud chci spouštět úlohy, které poběží dlouhou dobu (hodiny, dny), nepustím je všechny najednou, aby cluster mohli využívat i ostatní.
 +
 +=== Memory ===
 +
 +mem_free (or mf): this is a 'consumable resource' tracked by SGE.
 +  It affects job scheduling. Every machine as an initial value assigned.
 +  When you specify
 +    qsub -l mem_free=4G
 +  SGE finds a machine with mem_free >= 4GB, and subtracts 4GB from it.
 +
 +  This limit is not enforced, so if a job exceeds this limit, the
 +  SGE value of mem_free may not represent the real free memory.
 +
 +  Default value is 1GB.
  
 ===== Advanced usage ===== ===== Advanced usage =====

[ Back to the navigation ] [ Back to the content ]