[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
grid [2017/09/27 11:52]
popel
grid [2017/09/27 14:01]
popel [Rules]
Line 112: Line 112:
 The purpose of these rules is to prevent your jobs to damage the work of your colleagues and to divide the resources among users in a fair way. The purpose of these rules is to prevent your jobs to damage the work of your colleagues and to divide the resources among users in a fair way.
  
-  * Read about our [[internal:linux-network|network]] first (so you know that e.g. reading big data from you home in 200 parallel jobs is not a good idea). Ask your colleagues (possibly via [[internal:mailing-lists|devel]]) if you are not sure (esp. if you plan to submit jobs with unusual/extreme disk/mem/CPU requirements)+  * Read about our [[internal:linux-network|network]] first (so you know that e.g. reading big data from your home in 200 parallel jobs is not a good idea). Ask your colleagues (possibly via [[internal:mailing-lists|devel]]) if you are not sureesp. if you plan to submit jobs with unusual/extreme disk/mem/CPU requirements. 
-  * While your jobs are running (or queued), check your jobs (esp. previously untested setups) and email (including [[internal:mailing-lists|devel]]) regularly. If you really need to leave e.g. for two-week vacation offline, consult it first with it@ufal (whether they can kill your jobs if needed).+  * While your jobs are running (or queued), check your jobs (esp. previously untested setups) and your email (esp. [[internal:mailing-lists|devel]]) regularly. If you really need to leave e.g. for two-week vacation offline, consult it first with it@ufal (whether they can kill your jobs if needed).
   * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using ''htop'' etc.).   * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using ''htop'' etc.).
   * However, **never execute any computing manually** on a cluster machine where you are sshed (i.e. not via ''qsub'' or ''qrsh''). If you break this rule, your task will take CPU and memory, but the SGE will not know, so it may schedule other users' jobs on the same machine and **their jobs may fail** or run slowly. The sol machines are an exception from this rule.   * However, **never execute any computing manually** on a cluster machine where you are sshed (i.e. not via ''qsub'' or ''qrsh''). If you break this rule, your task will take CPU and memory, but the SGE will not know, so it may schedule other users' jobs on the same machine and **their jobs may fail** or run slowly. The sol machines are an exception from this rule.
Line 119: Line 119:
   * **Specify the memory and CPU requirements** (if higher than the defaults) and **don't exceed them**.   * **Specify the memory and CPU requirements** (if higher than the defaults) and **don't exceed them**.
     * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> (As you can see in [[#List of Machines]], the maximum is 32 cores). If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste).     * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> (As you can see in [[#List of Machines]], the maximum is 32 cores). If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste).
-    * <code>qsub -hard -l mem_free=8G -l act_mem_free=8G -l h_vmem=8G</code>  +    * If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use ''qsub'' (or ''qrsh'') with ''-l mem_free=8G''. You should specify also ''act_mem_free'' with the same value and ''h_vmem'' with possibly a slightly bigger value. See [[#memory]] for details. TL;DR: <code>qsub -hard -l mem_free=8G,act_mem_free=8G,h_vmem=12G</code>  
 +  * Be kind to your colleagues. If you are going to submit jobs that effectively take more than one fifth of our cluster for more than several hours, check if it is free (with ''qstat -g c'' or ''qstat -u \*''), ask your colleagues. Note that if you allocate one slot (CPU core) on a machine, but (almost) all its RAM, you have effectively occupied the whole machine and all its cores.
  
 +  
 Další doporučení: Další doporučení:
   * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí.   * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí.
   * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server to pak nepěkně zpomalí pro všechny. Distribuujte tedy i data.   * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server to pak nepěkně zpomalí pro všechny. Distribuujte tedy i data.
   * Pokud chci spouštět úlohy, které poběží dlouhou dobu (hodiny, dny), nepustím je všechny najednou, aby cluster mohli využívat i ostatní.   * Pokud chci spouštět úlohy, které poběží dlouhou dobu (hodiny, dny), nepustím je všechny najednou, aby cluster mohli využívat i ostatní.
 +
 +=== Memory ===
 +
 +mem_free (or mf): this is a 'consumable resource' tracked by SGE.
 +  It affects job scheduling. Every machine as an initial value assigned.
 +  When you specify
 +    qsub -l mem_free=4G
 +  SGE finds a machine with mem_free >= 4GB, and subtracts 4GB from it.
 +
 +  This limit is not enforced, so if a job exceeds this limit, the
 +  SGE value of mem_free may not represent the real free memory.
 +
 +  Default value is 1GB.
  
 ===== Advanced usage ===== ===== Advanced usage =====

[ Back to the navigation ] [ Back to the content ]