This is an old revision of the document!

ÚFAL Grid Engine (LRC)

ÚFAL Grid Engine (LRC)

LRC (Linguistic Research Cluster) is the name of ÚFAL's computational grid/cluster. The cluster is built on top of SLURM and is using Lustre for data storage.

Currently there are following partitions (queues) available for computing:

Node list by partitions

The naming convention is straightforward for CPU nodes - nodes in each group are numbered. For GPU nodes the format is: [t]dll-XgpuN where X gives total number of GPUs equipped and N is just enumerating the order of the node with the given configuration.
The prefix t is for nodes at Troja and dll stands for Deep Learning Laboratory.

cpu-troja

Node name	Thread count	Socket:Core:Thread	RAM (MB)
achilles1	32	2:8:2	128810
achilles2	32	2:8:2	128810
achilles3	32	2:8:2	128810
achilles4	32	2:8:2	128810
achilles5	32	2:8:2	128810
achilles6	32	2:8:2	128810
achilles7	32	2:8:2	128810
achilles8	32	2:8:2	128810
hector1	32	2:8:2	128810
hector2	32	2:8:2	128810
hector3	32	2:8:2	128810
hector4	32	2:8:2	128810
hector5	32	2:8:2	128810
hector6	32	2:8:2	128810
hector7	32	2:8:2	128810
hector8	32	2:8:2	128810
helena1	32	2:8:2	128811
helena2	32	2:8:2	128811
helena3	32	2:8:2	128811
helena4	32	2:8:2	128811
helena5	32	2:8:2	128810
helena6	32	2:8:2	128811
helena7	32	2:8:2	128810
helena8	32	2:8:2	128811
paris1	32	2:8:2	128810
paris2	32	2:8:2	128810
paris3	32	2:8:2	128810
paris4	32	2:8:2	128810
paris5	32	2:8:2	128810
paris6	32	2:8:2	128810
paris7	32	2:8:2	128810
paris8	32	2:8:2	128810
hyperion2	64	2:16:2	257667
hyperion3	64	2:16:2	257667
hyperion4	64	2:16:2	257667
hyperion5	64	2:16:2	257667
hyperion6	64	2:16:2	257667
hyperion7	64	2:16:2	257667
hyperion8	64	2:16:2	257667

cpu-ms

Node name	Thread count	Socket:Core:Thread	RAM (MB)
iridium	16	2:4:2	515977
orion1	40	2:10:2	128799
orion2	40	2:10:2	128799
orion3	40	2:10:2	128799
orion4	40	2:10:2	128799
orion5	40	2:10:2	128799
orion6	40	2:10:2	128799
orion7	40	2:10:2	128799
orion8	40	2:10:2	128799

gpu-troja

Node name	Thread count	Socket:Core:Thread	RAM (MB)	Features
tdll-3gpu1	64	2:16:2	128642	gpuram48G gpu_cc8.6
tdll-3gpu2	64	2:16:2	128642	gpuram48G gpu_cc8.6
tdll-3gpu3	64	2:16:2	128642	gpuram48G gpu_cc8.6
tdll-3gpu4	64	2:16:2	128642	gpuram48G gpu_cc8.6
tdll-8gpu1	64	2:16:2	257666	gpuram40G gpu_cc8.0
tdll-8gpu2	64	2:16:2	257666	gpuram40G gpu_cc8.0
tdll-8gpu3	32	2:8:2	253725	gpuram16G gpu_cc7.5
tdll-8gpu4	32	2:8:2	253725	gpuram16G gpu_cc7.5
tdll-8gpu5	32	2:8:2	253725	gpuram16G gpu_cc7.5
tdll-8gpu6	32	2:8:2	253725	gpuram16G gpu_cc7.5
tdll-8gpu7	32	2:8:2	253725	gpuram16G gpu_cc7.5

gpu-ms

Node name	Thread count	Socket:Core:Thread	RAM (MB)	Features
dll-3gpu1	64	2:16:2	128642	gpuram48G gpu_cc8.6
dll-3gpu2	64	2:16:2	128642	gpuram48G gpu_cc8.6
dll-3gpu3	64	2:16:2	128642	gpuram48G gpu_cc8.6
dll-3gpu4	64	2:16:2	128642	gpuram48G gpu_cc8.6
dll-3gpu5	64	2:16:2	128642	gpuram48G gpu_cc8.6
dll-4gpu1	40	2:10:2	187978	gpuram24G gpu_cc8.6
dll-4gpu2	40	2:10:2	187978	gpuram24G gpu_cc8.6
dll-8gpu1	64	2:16:2	515838	gpuram24G gpu_cc8.0
dll-8gpu2	64	2:16:2	515838	gpuram24G gpu_cc8.0
dll-8gpu3	32	2:8:2	257830	gpuram16G gpu_cc8.6
dll-8gpu4	32	2:8:2	253721	gpuram16G gpu_cc8.6
dll-8gpu5	40	2:10:2	385595	gpuram16G gpu_cc7.5
dll-8gpu6	40	2:10:2	385595	gpuram16G gpu_cc7.5
dll-10gpu1	32	2:8:2	257830	gpuram16G gpu_cc8.6
dll-10gpu2	32	2:8:2	257830	gpuram11G gpu_cc6.1
dll-10gpu3	32	2:8:2	257830	gpuram11G gpu_cc6.1

Submit nodes

In order to submit a job you need to login to one of the head nodes:

 lrc1.ufal.hide.ms.mff.cuni.cz
 lrc2.ufal.hide.ms.mff.cuni.cz
 sol1.ufal.hide.ms.mff.cuni.cz
 sol2.ufal.hide.ms.mff.cuni.cz
 sol3.ufal.hide.ms.mff.cuni.cz
 sol4.ufal.hide.ms.mff.cuni.cz

Basic usage

Batch mode

The core idea is that you write a batch script containing the commands you wish to run as well as a list of SBATCH directives specifying the resources or parameters that you need for your job.
Then the script is submitted to the cluster with:

sbatch myJobScript.sh

Here is a simple working example:

#!/bin/bash
#SBATCH -J helloWorld					  # name of job
#SBATCH -p cpu-troja					  # name of partition or queue (if not specified default partition is used)
#SBATCH -o helloWorld.out				  # name of output file for this submission script
#SBATCH -e helloWorld.err				  # name of error file for this submission script

# run my job (some executable)
sleep 5
echo "Hello I am running on cluster!"

After submitting this simple code you should end up with the two files (helloWorld.out and helloWorld.err) in the directory where you called the sbatch command.

Here is the list of other useful SBATCH directives:

#SBATCH -D /some/path/                        # change directory before executing the job   
#SBATCH -N 2                                  # number of nodes (default 1)
#SBATCH --nodelist=node1,node2...             # required node, or comma separated list of required nodes
#SBATCH --cpus-per-task=4                     # number of cores/threads per task (default 1)
#SBATCH --gres=gpu:1                          # number of GPUs to request (default 0)
#SBATCH --mem=10G                             # request 10 gigabytes memory (per node, default depends on node)

If you need you can have slurm report to you:

#SBATCH --mail-type=begin        # send email when job begins
#SBATCH --mail-type=end          # send email when job ends
#SBATCH --mail-type=fail         # send email if job fails
#SBATCH --mail-user=<YourUFALEmailAccount>

As usuall the complete set of options can be found by typing:

man sbatch

Running jobs

In order to inspect all running jobs on the cluster use:

squeue

filter only jobs of user linguist:

squeue -u linguist

filter only jobs on partition gpu-ms:

squeue -p gpu-ms

filter jobs in specific state (see man squeue for list of valid job states):

squeue -t RUNNING

filter jobs running on a specific node:

squeue -w dll-3gpu1

Cluster info

The command sinfo can give you useful information about nodes available in the cluster. Here is a short list of some examples:

List available partitions(queues). The default partition is marked with *:

sinfo

List detailed info about nodes:

sinfo -l -N

List nodes with some custom format info:

sinfo -N -o "%N %P %.11T %.15f"

CPU core allocation

The minimal computing resource in SLURM is one CPU core. However, CPU count advertised by SLURM corresponds to the number of CPU threads.
If you ask for 1 CPU core with

--cpus-per-task=1

SLURM will allocate all threads of 1 CPU core.

For example dll-8gpu1 will allocate 2 threads since its ThreadsPerCore=2:

$> scontrol show node dll-8gpu1
$ scontrol show node dll-8gpu1
NodeName=dll-8gpu1 Arch=x86_64 CoresPerSocket=16 
   CPUAlloc=0 CPUTot=64 CPULoad=0.05                                               // CPUAlloc - allocated threads, CPUTot - total threads
   AvailableFeatures=gpuram24G
   ActiveFeatures=gpuram24G
   Gres=gpu:nvidia_a30:8(S:0-1)
   NodeAddr=10.10.24.63 NodeHostName=dll-8gpu1 Version=21.08.8-2
   OS=Linux 5.15.35-1-pve #1 SMP PVE 5.15.35-3 (Wed, 11 May 2022 07:57:51 +0200) 
   RealMemory=515838 AllocMem=0 FreeMem=507650 Sockets=2 Boards=1
   CoreSpecCount=1 CPUSpecList=62-63                                               // CoreSpecCount - cores reserved for OS, CPUSpecList - list of threads reserved for system
   State=IDLE ThreadsPerCore=2 TmpDisk=0 Weight=1 Owner=N/A MCS_label=N/A          // ThreadsPerCore - count of threads for 1 CPU core
   Partitions=gpu-ms 
   BootTime=2022-09-01T14:07:50 SlurmdStartTime=2022-09-02T13:54:05
   LastBusyTime=2022-10-02T20:17:09
   CfgTRES=cpu=64,mem=515838M,billing=64
   AllocTRES=
   CapWatts=n/a
   CurrentWatts=0 AveWatts=0
   ExtSensorsJoules=n/s ExtSensorsWatts=0 ExtSensorsTemp=n/s

In the example above you can see comments at all lines relevant to CPU allocation.

Interactive mode

This mode can be useful for testing You should be using batch mode for any serious computation.
You can use srun command to get an interactive shell on an arbitrary node from the default partition (queue):

srun --pty bash

There are many more parameters available to use. For example:

To get an interactive CPU job with 64GB of reserved memory:

srun -p cpu-troja,cpu-ms --mem=64G --pty bash

-p cpu-troja explicitly requires partition cpu-troja. If not specified slurm will use default partition.
–mem=64G requires 64G of memory for the job

To get interactive job with a single GPU of any kind:

srun -p gpu-troja,gpu-ms --gres=gpu:1 --pty bash

-p gpu-troja,gpu-ms require only nodes from these two partitions
–gres=gpu:1 requires 1 GPUs

srun -p gpu-troja,gpu-ms --nodelist=tdll-3gpu1 --mem=64G --gres=gpu:2 --pty bash

-p gpu-troja,gpu-ms require only nodes from these two partitions
–nodelist=tdll-3gpu1 explicitly requires one specific node
–gres=gpu:2 requires 2 GPUs

srun -p gpu-troja --constraint="gpuram48G|gpuram40G" --mem=64G --gres=gpu:2 --pty bash

–constraint=“gpuram48G|gpuram40G” only consider nodes that have either gpuram48G or gpuram40G feature defined

Delete Job

scancel <job_id>

To see all the available options type:

man srun

Institute of Formal and Applied Linguistics Wiki

Table of Contents