Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
grid [2017/09/26 18:30] popel delete outdated/old tricks |
grid [2017/10/02 16:56] popel [Advanced usage] |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== ÚFAL Grid Engine (LRC) ====== | ====== ÚFAL Grid Engine (LRC) ====== | ||
- | LRC (Linguistic Research Cluster) is a name of ÚFAL' | + | LRC (Linguistic Research Cluster) is a name of ÚFAL' |
If you need GPU processing, see a special page about our [[:gpu|GPU cluster called DLL]] (which is actually a subsystem of LRC with independent queue '' | If you need GPU processing, see a special page about our [[:gpu|GPU cluster called DLL]] (which is actually a subsystem of LRC with independent queue '' | ||
===== List of Machines ===== | ===== List of Machines ===== | ||
- | The list has been updated | + | Last update: |
Some machines are at Malá Strana (ground floor, new server room built from Lindat budget), some are at Troja (5 km north-east). | Some machines are at Malá Strana (ground floor, new server room built from Lindat budget), some are at Troja (5 km north-east). | ||
If you need to quickly distinguish which machine is located where, you can use your knowledge of [[https:// | If you need to quickly distinguish which machine is located where, you can use your knowledge of [[https:// | ||
Line 46: | Line 46: | ||
The **pandora machines** are in a special cluster (not accessible from lrc) and queue **ms-guests.q** available for our colleagues from KSVI and for students of [[http:// | The **pandora machines** are in a special cluster (not accessible from lrc) and queue **ms-guests.q** available for our colleagues from KSVI and for students of [[http:// | ||
- | fronty výpočtů, které se odesílají na výpočetní stroje v clusteru. Na hlavě se nemají pouštět žádné náročné výpočty a naopak na ostatní stroje v clusteru se mají výpočty odesílat výhradně prostřednictvím hlavy. Hlava má 2 procesory Intel Pentium D 3 GHz a 1 GB paměti. Ve skutečnosti existují hlavy dvě - lrc1 a lrc2, které sdílí IP adresu lrc.ufal.hide.ms.mff.cuni.cz. V případě výpadku jedné z hlav, přebírá kontrolu ta druhá. | + | ===== Installation ===== |
- | Frontovací systém umožňuje: | + | |
- | * využít na maximum výpočetní výkon | + | Add the following line into your ' |
- | * poslat mnoho úloh k řešení najednou, úlohy budou ale spuštěny teprve, když na to bude čas | + | |
- | * " | + | |
- | ===== Jak začít ===== | + | source / |
- | Jednou za život musíte provést [[Základní nastavení SGE]], abyste SGE mohli používat. | + | This detects if you are on one of the cluster machines (including lrc and sol) and sets env variables accordingly. It also prints a status message. |
+ | Usually, this is the first line of your '~/.bash_profile' | ||
- | ===== Ukázka užití SGE ===== | + | [ -f ~/.bashrc ] && source ~/.bashrc |
- | Tato posloupnost příkazů ukazuje, jak užít SGE: | + | ===== Basic usage ===== |
+ | |||
+ | First, you need to ssh to the cluster head (lrc1 or lrc2) or to one of the sol machines. The full address is '' | ||
< | < | ||
- | ssh lrc2 | + | ssh lrc1 |
- | # přihlašte se na hlavu clusteru | + | echo 'hostname; pwd; echo The second parameter is $2' |
- | echo "hostname; pwd" | + | # prepare a shell script describing your task |
- | # vyrobte skript, který popisuje, co má úloha udělat | + | qsub -cwd -j y script.sh Hello World |
- | qsub -cwd -j y skript.sh | + | # This submits your job to the default queue, which is currently '' |
- | # zařaďte úlohu do fronty. | + | # Usually, there is a free slot, so the job will be scheduled within few seconds. |
- | # Vlastně stačilo zavolat: qsub skript.sh | + | # We have used two handy qsub parameters: |
- | # Ale dodatečné parametry zařídily: | + | # -cwd ... the script is executed in the current directory |
- | # -cwd ... skript bude spuštěn v aktuálním adresáři | + | # -j y ... stdout and stderr outputs are merged and redirected to a file ('' |
- | # -V ... proměnné z vašeho prostředí budou zkopírovány do prostředí skriptu | + | # We have also provided two parameters for our script " |
- | # -j y ... standardní | + | # The qsub prints something like |
- | # Pořadí parametrů **je** důležité, | + | # Your job 121144 (" |
qstat | qstat | ||
- | qstat -u ' | + | # This way we inspect all our jobs (both waiting in queue and scheduled, i.e. running). |
- | # Podívejme se, jaké vaše úlohy běží. | + | qstat -u ' |
- | # SGE chvíli čeká, než skript opravdu spustí. Pro malinké úlohy tedy SGE může představovat | + | # This shows jobs of all users. |
- | # zbytečné zpoždění. | + | qstat -j 121144 |
- | # -u '*' ukáže úlohy všech uživatelů na clusteri | + | # This shows detailed info about the job with this number (if it is still running). |
- | cat skript.sh.oXXXXX | + | less script.sh.o* |
- | # vypište si výstup skriptu. XXXXX je ID jobu, které bylo přiděleno | + | # We can inspect the job's output (in our case stored in script.sh.o121144). |
- | # qsubem. Čili druhé poslání do fronty starší log typicky nepřepíše. | + | # Hint: if the job is still running, press F in ' |
</ | </ | ||
- | A takto dopadl výstup našeho skriptu: | + | The output of our job should look like: |
< | < | ||
- | Warning: no access to tty (Bad file descriptor). | + | LRC:ubuntu 14.04: 8.1.7a Son of Grid Engine variables set... |
- | Thus no job control in this shell. | + | lucifer5 |
- | sol2.ufal.hide.ms.mff.cuni.cz | + | /home/popel/tmp |
- | /export/home/bojar | + | The second parameter is World |
+ | ======= EPILOG: Tue Sep 26 19:49:05 CEST 2017 | ||
+ | == Limits: | ||
+ | == Usage: | ||
+ | == Duration: 00:00:02 (2 s) | ||
</ | </ | ||
- | Další užitečné příkazy a parametry: | + | Our admins configured the SGE to print some extra info on stderr: the first line and then the epilog. |
+ | The '' | ||
+ | The '' | ||
< | < | ||
- | qsub -o LOG.stdout -e LOG.stderr skript.sh | + | qdel 121144 |
- | # když chcete přesměrovat výstup skriptu do určených souborů | + | # This way you can delete (" |
- | qsub -S /bin/bash | + | |
- | # když chcete, aby skript běžel v bashi | + | |
- | qsub -V | + | |
- | # když chcete předat proměnné prostředí | + | |
qdel \* | qdel \* | ||
- | # když chcete zrušit všechny své joby (rušit cizí nesmíte) | + | # This way you can delete all your jobs. Don't be afraid - you cannot delete others jobs. |
</ | </ | ||
+ | ===== Rules ===== | ||
+ | The purpose of these rules is to prevent your jobs to damage the work of your colleagues and to divide the resources among users in a fair way. | ||
+ | * Read about our [[internal: | ||
+ | * While your jobs are running (or queued), check your jobs (esp. previously untested setups) and your email (esp. [[internal: | ||
+ | * You can ssh to any cluster machine, which can be useful e.g. to diagnose what's happening there (using '' | ||
+ | * However, **never execute any computing manually** on a cluster machine where you are sshed (i.e. not via '' | ||
+ | * For interactive work, you can use '' | ||
+ | * **Specify the memory and CPU requirements** (if higher than the defaults) and **don' | ||
+ | * If your job needs more than one CPU (on a single machine) for most of the time, reserve the given number of CPU cores (and SGE slots) with < | ||
+ | * If you are sure your job needs less than 1GB RAM, then you can skip this. Otherwise, if you need e.g. 8 GiB, you must always use '' | ||
+ | * Be kind to your colleagues. If you are going to submit jobs that effectively occupy **more than one fifth of our cluster for more than several hours**, check if the cluster is free (with '' | ||
+ | === Memory === | ||
- | ===== Pravidla pro správné používání clusteru ===== | + | * There are three commonly used options for specifying memory requirements: |
+ | * '' | ||
+ | * **mem_free** (or mf) specifies a // | ||
+ | * **act_mem_free** (or amf) is a ÚFAL-specific option, which specifies the real amount of free memory (at the time of scheduling). You can specify it when submitting a job and it will be scheduled to a machine with at least this amount of memory free. In an ideal world, where no jobs are exceeding their '' | ||
+ | * **h_vmem** is equivalent to setting '' | ||
+ | * It is recommended to **profile your task first**, so you can estimate reasonable memory requirements before submitting many jobs with the same task (varying in parameters which do not affect memory consumption). So for the first time, declare mem_free with much more memory than expected and ssh to a given machine and check '' | ||
+ | * **s_vmem** is similar to '' | ||
- | Základní pravidlo, které musíme všichni ctít, aby SGE plnilo svou úlohu dobře: | ||
- | * Nespouštět úlohy ručně. (O ručně spuštěných úlohách SGE nemá informaci, klidně na daný uzel pošle ještě další úlohy z fronty.) | + | ===== Advanced usage ===== |
- | * Interaktivní shell se dá získat příkazem '' | + | |
- | Další doporučení: | + | '' |
- | | + | This way your job is submitted to the Troja queue. The default is '' |
- | | + | '' |
- | * Uklízet po sobě lokální data, protože jinak si tam už nikdo nic užitečného nepustí. | + | '' |
- | * Vyhnout se hodně divokému paralelnímu přístupu ke sdíleným diskům. NFS server | + | '' |
- | * Informovat SGE, kolik paměti úloha žere, aby na strojích nedošla paměť | + | However, usually you should specify just the queue (troja-all.q vs. ms-all.q), not a particular machine, and instead use '' |
+ | '' | ||
+ | See '' | ||
- | Víc pravidel není. | + | '' |
+ | Define a priority of your job as a number between -1024 and 0. Only SGE admins may use a number higher than 0. The default is 0, i.e. the highest possible priority. SGE uses the priority to decide when to start which pending job in the queue (it computes a real number called '' | ||
- | ===== Slušné chování ===== | + | '' |
+ | redirect std{out, | ||
- | Pokud chci spouštět úlohy, které poběží dlouhou dobu (hodiny, dny), nepustím je všechny najednou, aby cluster mohli využívat i ostatní. | + | '' |
+ | Instead of specifying all the '' | ||
- | ===== Triky a opentlení ===== | + | '' |
+ | Execute your job no sooner than at the given time (in '' | ||
+ | '' | ||
+ | Treat '' | ||
- | ==== ~bojar/ | + | '' |
+ | Specify the emails where you want to be notified when the job has been **b** started, **e** ended, **a** aborted or rescheduled, | ||
- | qsubmit je jako qsub, ale příjemnější: | + | '' |
+ | The current job is not executed before all the specified jobs are completed. | ||
- | | + | '' |
- | * nemusíte připisovat | + | Start the job immediately or not at all, i.e. don't put it as pending to the queue. This is the default for '' |
+ | |||
+ | '' | ||
+ | By default the name of a job (which you can see e.g. in '' | ||
+ | |||
+ | |||
+ | '' | ||
+ | The hashbang ('' | ||
+ | |||
+ | '' | ||
+ | Export a given environment variable from the current shell to the job. | ||
+ | |||
+ | '' | ||
+ | Export all environment variables. (This is not so needed now, when bash is the default interpreter and it seems your '' | ||
+ | |||
+ | '' | ||
+ | By default, all the resource requirements (specified with '' | ||
+ | |||
+ | '' | ||
+ | This causes qsub to wait for the job to complete before exiting (with the same exit code as the job). Useful in scripts. | ||
+ | |||
+ | '' | ||
+ | You can change some properties of already submitted jobs, which are still waiting in the queue (// | ||
+ | |||
+ | '' | ||
+ | Find out all the gory details which are missing here. You'll have to do it one day anyway:-). | ||
+ | |||
+ | === qsub wrappers === | ||
+ | |||
+ | If you often run (ad-hoc) bash commands via '' | ||
+ | |||
+ | === qunhold === | ||
+ | '' | ||
+ | |||
+ | === sshcwd === | ||
+ | This is useful not only when sshing to sol machines. Add the following lines to your '' | ||
< | < | ||
- | ~bojar/ | + | function sshcwd () { |
- | </ | + | # save the current history so that it is available |
+ | # immediately on the remote machine | ||
+ | history -a; | ||
+ | # setup the working directory by setting WD | ||
+ | ssh -X -Y -C -t $@ " | ||
+ | } | ||
+ | |||
+ | # use WD to setup the working directory | ||
+ | if [ -n " | ||
+ | echo " | ||
+ | cd $WD; | ||
+ | fi | ||
+ | |||
+ | alias sol1=" | ||
+ | </ | ||
+ | |||
+ | === In-script options === | ||
+ | |||
+ | If you don't use '' | ||
+ | |||
+ | #$ -l mem_free=10G, | ||
+ | #$ -pe smp 8 | ||
+ | #$ -cwd -j y | ||
+ | #$ -N name-of-my-job | ||
+ | ./ | ||
+ | |||
+ | and you execute it now simply with '' | ||
+ | |||
+ | === Array jobs === | ||
- | lépe funguje | + | If you have a set of tasks (of the same type) and want to run them on multiple machines, use '' |
+ | TODO | ||
===== Monitorování úloh ===== | ===== Monitorování úloh ===== | ||
Line 187: | Line 279: | ||
Pokud chcete submittovaný program pouštět ve svém oblíbeném prostředí (např. nastavení '' | Pokud chcete submittovaný program pouštět ve svém oblíbeném prostředí (např. nastavení '' | ||
- | |||
- | ==== Jiný shell ==== | ||
- | |||
- | Abych mohl poslat nějakou úlohu do fronty, musím pro ni vyrobit vlastní skript. Budiž, vyrobil jsem vlastní skript: | ||
- | |||
- | < | ||
- | #!/bin/bash | ||
- | program > log.out 2> log.err | ||
- | </ | ||
- | |||
- | Když tento skript spustím, stane se očekávané. Přesměrují se výstupy z daného programu do souborů a je to. | ||
- | |||
- | Když takový skript submitnu, program se **nespustí**. V logu zjistím, že (standardní chybový) výstup shellu, který pouštěl můj skript praví kryptickou zprávu " | ||
- | |||
- | Nebudu vás napínat, zde je vysvětlení: | ||
- | |||
- | Takto SGE přinutíte, | ||
- | |||
- | < | ||
- | qsub -S /bin/bash skript | ||
- | </ | ||
- | |||
- | Jinou možností je přesměrovat stderr a stdout pomocí syntaxe csh: | ||
- | |||
- | < | ||
- | ( command > | ||
- | </ | ||
- | |||
==== bashrc a podobné nesmí nic vypisovat na konzoli ==== | ==== bashrc a podobné nesmí nic vypisovat na konzoli ==== | ||
Line 241: | Line 305: | ||
hard resource_list: | hard resource_list: | ||
hard resource_list: | hard resource_list: | ||
- | |||
- | ==== Jak rezervovat více jader na stejném stroji pro 1 job ==== | ||
- | |||
- | < | ||
- | qsub -pe smp <pocet jader> | ||
- | </ | ||