Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
spark:using-python [2017/10/16 20:58] ufal [Using Python] |
spark:using-python [2019/11/12 13:37] straka |
<file>pyspark</file> | <file>pyspark</file> |
| |
Better interactive shell with code completion using ''ipython'' (installed everywhere on cluster; ask our IT if you want to have it installed on your workstations too) can be started using: | Better interactive shell with code completion using ''ipython3'' (do `pip3 install --user ipython` if you do not have it) can be started using: |
<file>PYSPARK_DRIVER_PYTHON=ipython pyspark</file> | <file>PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file> |
| |
As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). | As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). |
.map(lambda word: (word, 1)) | .map(lambda word: (word, 1)) |
.reduceByKey(lambda c1,c2: c1+c2) | .reduceByKey(lambda c1,c2: c1+c2) |
.sortBy(lambda (word,count): count, ascending=False) | .sortBy(lambda word_count: word_count[1], ascending=False) |
.take(10)) | .take(10)) |
</file> | </file> |
| |
* run interactive shell using existing Spark cluster (i.e., inside ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none: | * run interactive shell using existing Spark cluster (i.e., inside ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none: |
<file>PYSPARK_DRIVER_PYTHON=ipython pyspark</file> | <file>PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file> |
* run interactive shell with local Spark cluster using one thread: | * run interactive shell with local Spark cluster using one thread: |
<file>MASTER=local PYSPARK_DRIVER_PYTHON=ipython pyspark</file> | <file>MASTER=local PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file> |
* start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell: | * start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell: |
<file>PYSPARK_DRIVER_PYTHON=ipython spark-qrsh 10 1G pyspark</file> | <file>PYSPARK_DRIVER_PYTHON=ipython3 spark-qrsh 10 1G pyspark</file> |
| |
Note that ''PYSPARK_DRIVER_PYTHON'' variable can be left out or specified in ''.bashrc'' (or similar). | Note that ''PYSPARK_DRIVER_PYTHON'' variable can be left out or specified in ''.bashrc'' (or similar). |
.map(lambda token: (token, 1)) | .map(lambda token: (token, 1)) |
.reduceByKey(lambda x,y: x + y) | .reduceByKey(lambda x,y: x + y) |
.sortBy(lambda (word,count): count, ascending=False) | .sortBy(lambda word_count: word_count[1], ascending=False) |
.saveAsTextFile(output)) | .saveAsTextFile(output)) |
sc.stop() | sc.stop() |