Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
spark:running-spark-on-single-machine-or-on-cluster [2022/12/14 12:49] straka [Starting Spark Cluster] |
spark:running-spark-on-single-machine-or-on-cluster [2023/11/07 12:48] (current) straka [Starting Spark Cluster] |
Spark computations can be started both on desktop machines and on cluster machines, either by specifying ''MASTER'' to one of ''local'' modes, or by not specifying MASTER at all (''local[*]'' is used then). | Spark computations can be started both on desktop machines and on cluster machines, either by specifying ''MASTER'' to one of ''local'' modes, or by not specifying MASTER at all (''local[*]'' is used then). |
| |
Note that when you use ''qrsh'' or ''qsub'', your job is usually expected to use one core, so you should specify ''MASTER=local''. If you do not, Spark will use all cores on the machine, even though SGE gave you only one. | Note that when you use ''sbatch'' or ''srun'' to run a cluster locally, your job is by default expected to use just a single core, so you should specify ''MASTER=local''. If you do not, Spark will use all cores on the machine, even though Slurm gave you only one. |
| |
===== Starting Spark Cluster ===== | ===== Starting Spark Cluster ===== |
| |
The Spark cluster can be started using one of the following two commands: | The Spark cluster can be started using one of the following two commands: |
* ''spark-sbatch'': start a Spark cluster via an ''sbatch'' <file>spark-srun [sbatch args] workers memory_per_workerG[:python_memoryG] command [arguments...]</file> | * ''spark-sbatch'': start a Spark cluster via an ''sbatch'' <file>spark-sbatch [sbatch args] workers memory_per_workerG[:python_memoryG] command [arguments...]</file> |
* ''spark-srun'': start a Spark cluster via an ''srun'' <file>spark-srun [salloc args] workers memory_per_workerG[:python_memoryG] [command arguments...]</file> | * ''spark-srun'': start a Spark cluster via an ''srun'' <file>spark-srun [salloc args] workers memory_per_workerG[:python_memoryG] [command arguments...]</file> |
| |
Both ''spark-sbatch'' and ''spark-srun'' commands start a Spark cluster with the specified number of workers, each with the given amount of memory. Then they set ''MASTER'' and ''SPARK_ADDRESS'' to the address of the Spark master and ''SPARK_WEBUI'' to the URL of the master web interface. Both these values are also written on standard output and added to the Slurm job Comment. Finally, the specified command is started; when ''spark-srun'' is used, the command may be empty, in which case ''bash'' is opened. | Both ''spark-sbatch'' and ''spark-srun'' commands start a Spark cluster with the specified number of workers, each with the given amount of memory. Then they set ''MASTER'' and ''SPARK_ADDRESS'' to the address of the Spark master and ''SPARK_WEBUI'' to the URL of the master web interface. Both these values are also written on standard output, and the ''SPARK_WEBUI'' is added to the Slurm job Comment. Finally, the specified command is started; when ''spark-srun'' is used, the command may be empty, in which case ''bash'' is opened. |
| |
==== Memory Specification ==== | ==== Memory Specification ==== |
| |
Memory specification used for master and worker heap size (and for ''mem_free'' SGE constraint) must be specified. The memory can be specified either in bytes, or using ''kK/mM/gG'' suffix. A reasonable default value is 512M or 1G. | TL;DR: Good default is ''2G''. |
| |
| The memory for each worker is specified using the following format: <file>spark_memory_per_workerG[:memory_per_Python_processG]</file> |
| |
| The Spark memory limits the Java heap, and half of it is reserved for memory storage of cached RDDs. The second value sets a memory limit of every Python process and is by default set to ''2G''. |
| |
==== Examples ==== | ==== Examples ==== |
| |
Start Spark cluster with 10 machines 1GB RAM each and then run interactive shell. The cluster stops after the shell is exited. | Start Spark cluster with 10 workers 2GB RAM each and then run interactive shell. The cluster stops after the shell is exited. |
<file>spark-qrsh 10 1G</file> | <file>spark-srun 10 2G</file> |
| |
Start Spark cluster with 20 machines 512MB RAM each. The cluster has to be stopped manually using ''qdel''. | |
<file>spark-qsub 20 512m sleep infinity</file> | |
| |
Note that a running Spark cluster can currently be used only from other cluster machines (connections to a running SGE Spark cluster from my workstation ends with timeout). | |
| |
==== Additional SGE Options ==== | |
| |
Additional ''qrsh'' or ''qsub'' options can be specified in ''SGE_OPTS'' environmental variable (not as ''spark-qsub'' or ''spark-qrsh'' arguments), as in the following example which schedules the Spark master and workers to machines different then ''hyperion*'' and ''pandora*'': | Start Spark cluster with 20 workers 4GB RAM each in the ''cpu-ms'' partition, and run ''screen'' in it, so that several computations can be performed using this cluster. The cluster has to be stopped manually (either by quitting the scree or calling ''scancel''). |
<file>SGE_OPTS='-q *@!(hyperion*|pandora*)' spark-qrsh 10 1G</file> | <file>spark-sbatch -p cpu-ms 20 4G screen -D -m</file> |
| |