====== Running Spark on Single Machine or on Cluster ======

In order to use Spark, environment has to bee set up according to [[:spark#using-spark-in-ufal-environment|Using Spark in UFAL Environment]].

When Spark computation starts, it uses environment variable ''MASTER'' to determine the mode of computation. The following values are possible:
  * ''local'': Run locally using single thread.
  * ''local[N]'' (e.g., ''local[2]'' or ''local[4]''): Run locally using ''N'' threads.
  * ''local[*]'' (default if ''MASTER'' variable does not exist): Run locally using as many threads as there are processor cores.
  * ''spark:/''''/master_address:master_port'': Run in a distributed fashion using specified master.

===== Running Spark on Single Machine =====

Spark computations can be started both on desktop machines and on cluster machines, either by specifying ''MASTER'' to one of ''local'' modes, or by not specifying MASTER at all (''local[*]'' is used then).

Note that when you use ''sbatch'' or ''srun'' to run a cluster locally, your job is by default expected to use just a single core, so you should specify ''MASTER=local''. If you do not, Spark will use all cores on the machine, even though Slurm gave you only one.

===== Starting Spark Cluster  =====

Spark cluster can be started using Slurm. The cluster is user-specific, but it can be used for several consecutive Spark computations.

The Spark cluster can be started using one of the following two commands:
  * ''spark-sbatch'': start a Spark cluster via an ''sbatch'' <file>spark-sbatch [sbatch args] workers memory_per_workerG[:python_memoryG] command [arguments...]</file>
  * ''spark-srun'': start a Spark cluster via an ''srun'' <file>spark-srun [salloc args] workers memory_per_workerG[:python_memoryG] [command arguments...]</file>

Both ''spark-sbatch'' and ''spark-srun'' commands start a Spark cluster with the specified number of workers, each with the given amount of memory. Then they set ''MASTER'' and ''SPARK_ADDRESS'' to the address of the Spark master and ''SPARK_WEBUI'' to the URL of the master web interface. Both these values are also written on standard output, and the ''SPARK_WEBUI'' is added to the Slurm job Comment. Finally, the specified command is started; when ''spark-srun'' is used, the command may be empty, in which case ''bash'' is opened.

==== Memory Specification ====

TL;DR: Good default is ''2G''.

The memory for each worker is specified using the following format: <file>spark_memory_per_workerG[:memory_per_Python_processG]</file>

The Spark memory limits the Java heap, and half of it is reserved for memory storage of cached RDDs. The second value sets a memory limit of every Python process and is by default set to ''2G''.

==== Examples ====

Start Spark cluster with 10 workers 2GB RAM each and then run interactive shell. The cluster stops after the shell is exited.
<file>spark-srun 10 2G</file>

Start Spark cluster with 20 workers 4GB RAM each in the ''cpu-ms'' partition, and run ''screen'' in it, so that several computations can be performed using this cluster. The cluster has to be stopped manually (either by quitting the scree or calling ''scancel'').
<file>spark-sbatch -p cpu-ms 20 4G screen -D -m</file>