[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
grid [2017/09/26 20:38]
popel
grid [2017/09/27 12:01]
popel
Line 104: Line 104:
 <code> <code>
 qdel 121144 qdel 121144
-  # This way you can delete a job with a given number, or comma-or-space separated list of job numbers.+  # This way you can delete ("kill"a job with a given number, or comma-or-space separated list of job numbers.
 qdel \* qdel \*
   # This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs.   # This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs.
Line 110: Line 110:
  
 ===== Rules ===== ===== Rules =====
 +The purpose of these rules is to prevent your jobs to damage the work of your colleagues and to divide the resources among users in a fair way.
  
 +  * Read about our [[internal:linux-network|network]] first (so you know that e.g. reading big data from your home in 200 parallel jobs is not a good idea). Ask your colleagues (possibly via [[internal:mailing-lists|devel]]) if you are not sure, esp. if you plan to submit jobs with unusual/extreme disk/mem/CPU requirements.
 +  * While your jobs are running (or queued), check your jobs (esp. previously untested setups) and your email (esp. [[internal:mailing-lists|devel]]) regularly. If you really need to leave e.g. for two-week vacation offline, consult it first with it@ufal (whether they can kill your jobs if needed).
   * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using ''htop'' etc.).   * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using ''htop'' etc.).
-  * However, **never execute any computing manually** on a cluster machine, unless via ''qsub'' or ''qrsh''. If you break this rule, your task will take CPU and memory, but the SGE will not know, so it may schedule other users' jobs on the same machine and **their jobs may fail** or run slowly. The sol machines are an exception from this rule. +  * However, **never execute any computing manually** on a cluster machine where you are sshed (i.e. not via ''qsub'' or ''qrsh''). If you break this rule, your task will take CPU and memory, but the SGE will not know, so it may schedule other users' jobs on the same machine and **their jobs may fail** or run slowly. The sol machines are an exception from this rule. 
-  * For interactive work, you can use ''qrsh'', but please try to end the job (exit with Ctrl+D) once finished with your work, especially if you ask for a lot of memory or CPUs (see below). +  * For interactive work, you can use ''qrsh'', but please try to end the job (exit with Ctrl+D) once finished with your work, especially if you ask for a lot of memory or CPUs (see below). One semi-permanent qrsh job (with non-extreme CPU/mem requirements) per user is acceptable
-  * **Specify the memory requirements** of your job, e.g. with <code>qsub -hard -l mem_free=8G -l act_mem_free=8G -l h_vmem=8G</code> +  * **Specify the memory and CPU requirements** (if higher than the defaults) and **don't exceed them**. 
- +    * If your job needs more than one CPU (on a single machine) for most of the timereserve the given number of CPU cores (and SGE slots) with <code>qsub -pe smp <number-of-CPU-cores></code> (As you can see in [[#List of Machines]], the maximum is 32 cores). If your job needs e.g. up to 110% CPU most of the time and just occasionally 200%, it is OK to reserve just one core (so you don't waste). 
-TODO explanation +    * <code>qsub -hard -l mem_free=8G -l act_mem_free=8G -l h_vmem=8G</code> 
-Informovat SGE, kolik paměti úloha žere, aby na strojích nedošla paměť (a používat "hard" limit, kdy SGE úlohu zabije, pokud rezervovanou paměť překročí): +
  
  
 Další doporučení: Další doporučení:
-  * Pokud možno používat ''nice''. 
-      *  Dotaz: jak se kombinuje ''nice'' s ''qsub''em? SGE je snad nyní nastaveno tak, že vše bude nicenuté. Každopádně je dobré do submitovaného skriptu na začátek napsat ''renice 10 $$''. 
   * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí.   * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí.
   * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server to pak nepěkně zpomalí pro všechny. Distribuujte tedy i data.   * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server to pak nepěkně zpomalí pro všechny. Distribuujte tedy i data.

[ Back to the navigation ] [ Back to the content ]