Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
spark:using-python [2014/11/11 09:28] straka |
spark:using-python [2019/11/12 13:19] straka |
<file>pyspark</file> | <file>pyspark</file> |
| |
Better interactive shell with code completion using ''ipython'' (installed everywhere on cluster; ask our IT if you want to have it installed on your workstations too) can be started using: | Better interactive shell with code completion using ''ipython3'' (do `pip3 install --user ipython` if you do not have it) can be started using: |
<file>IPYTHON=1 pyspark</file> | <file>PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file> |
| |
As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). | As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). |
| |
* run interactive shell using existing Spark cluster (i.e., inside ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none: | * run interactive shell using existing Spark cluster (i.e., inside ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none: |
<file>IPYTHON=1 pyspark</file> | <file>PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file> |
* run interactive shell with local Spark cluster using one thread: | * run interactive shell with local Spark cluster using one thread: |
<file>MASTER=local IPYTHON=1 pyspark</file> | <file>MASTER=local PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file> |
* start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell: | * start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell: |
<file>IPYTHON=1 spark-qrsh 10 1G pyspark</file> | <file>PYSPARK_DRIVER_PYTHON=ipython3 spark-qrsh 10 1G pyspark</file> |
| |
Note that ''IPYTHON'' variable can be left out or specified in ''.bashrc'' (or similar). | Note that ''PYSPARK_DRIVER_PYTHON'' variable can be left out or specified in ''.bashrc'' (or similar). |
| |
| |
| |
* run ''word_count.py'' script inside existing Spark cluster (i.e., inside ''spark-qsub'' or ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none: | * run ''word_count.py'' script inside existing Spark cluster (i.e., inside ''spark-qsub'' or ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none: |
<file>spark-submit word_count.py input output</file> | <file>spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file> |
* run ''word_count.py'' script with local Spark cluster using one thread: | * run ''word_count.py'' script with local Spark cluster using one thread: |
<file>MASTER=local spark-submit word_count.py input output</file> | <file>MASTER=local spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file> |
* start Spark cluster (10 machines, 1GB RAM each) on SGE and run ''word_count.py'' script: | * start Spark cluster (10 machines, 1GB RAM each) on SGE and run ''word_count.py'' script: |
<file>spark-qsub 10 1G spark-submit word_count.py input output</file> | <file>spark-qsub 10 1G spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file> |
| |
| ===== Using Virtual Environments ===== |
| |
| If you want to use specific virtual environment in your Spark job, use |
| <file>PYSPARK_PYTHON=path_to_python_in_venv [pyspark|spark-submit]</file> |
| |