Differences
This shows you the differences between two versions of the page.
Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
spark:running-spark-on-single-machine-or-on-cluster [2022/12/14 12:49] straka [Starting Spark Cluster] |
spark:running-spark-on-single-machine-or-on-cluster [2022/12/14 12:53] straka [Memory Specification] |
* ''spark-srun'': start a Spark cluster via an ''srun'' <file>spark-srun [salloc args] workers memory_per_workerG[:python_memoryG] [command arguments...]</file> | * ''spark-srun'': start a Spark cluster via an ''srun'' <file>spark-srun [salloc args] workers memory_per_workerG[:python_memoryG] [command arguments...]</file> |
| |
Both ''spark-sbatch'' and ''spark-srun'' commands start a Spark cluster with the specified number of workers, each with the given amount of memory. Then they set ''MASTER'' and ''SPARK_ADDRESS'' to the address of the Spark master and ''SPARK_WEBUI'' to the URL of the master web interface. Both these values are also written on standard output and added to the Slurm job Comment. Finally, the specified command is started; when ''spark-srun'' is used, the command may be empty, in which case ''bash'' is opened. | Both ''spark-sbatch'' and ''spark-srun'' commands start a Spark cluster with the specified number of workers, each with the given amount of memory. Then they set ''MASTER'' and ''SPARK_ADDRESS'' to the address of the Spark master and ''SPARK_WEBUI'' to the URL of the master web interface. Both these values are also written on standard output, and the ''SPARK_WEBUI'' is added to the Slurm job Comment. Finally, the specified command is started; when ''spark-srun'' is used, the command may be empty, in which case ''bash'' is opened. |
| |
==== Memory Specification ==== | ==== Memory Specification ==== |
| |
Memory specification used for master and worker heap size (and for ''mem_free'' SGE constraint) must be specified. The memory can be specified either in bytes, or using ''kK/mM/gG'' suffix. A reasonable default value is 512M or 1G. | TL;DR: Good default is ''2G'. |
| |
| The memory for each worker is specified using the following format: <file>spark_memory_per_workerG[:memory_per_Python_processG]</file> |
| The Spark memory limits the Java heap, and half of it is reserved for memory storage of cached RDDs. The second value sets a memory limit of every Python process and is by default set to ''2G''. |
==== Examples ==== | ==== Examples ==== |
| |