[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
spark:using-python [2019/11/12 13:37]
straka
spark:using-python [2022/12/14 13:25]
straka [Usage Examples]
Line 19: Line 19:
    .flatMap(lambda line: line.split())    .flatMap(lambda line: line.split())
    .map(lambda word: (word, 1))    .map(lambda word: (word, 1))
-   .reduceByKey(lambda c1,c2: c1+c2)+   .reduceByKey(lambda c1, c2: c1+c2)
    .sortBy(lambda word_count: word_count[1], ascending=False)    .sortBy(lambda word_count: word_count[1], ascending=False)
    .take(10))    .take(10))
 </file> </file>
  
-  * run interactive shell using existing Spark cluster (i.e., inside ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none:+  * run interactive shell using existing Spark cluster (i.e., inside ''spark-srun''), or start local Spark cluster using as many threads as there are cores if there is none:
   <file>PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file>   <file>PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file>
   * run interactive shell with local Spark cluster using one thread:   * run interactive shell with local Spark cluster using one thread:
   <file>MASTER=local PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file>   <file>MASTER=local PYSPARK_DRIVER_PYTHON=ipython3 pyspark</file>
-  * start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell: +  * start Spark cluster (10 machines, 2GB RAM each) on Slurm and run interactive shell: 
-  <file>PYSPARK_DRIVER_PYTHON=ipython3 spark-qrsh 10 1G pyspark</file>+  <file>PYSPARK_DRIVER_PYTHON=ipython3 spark-srun 10 2G pyspark</file>
  
-Note that ''PYSPARK_DRIVER_PYTHON'' variable can be left out or specified in ''.bashrc'' (or similar).+Note that ''PYSPARK_DRIVER_PYTHON'' variable can be left out or specified in ''.bashrc'' (or other configuration files).
  
  
Line 59: Line 59:
    .flatMap(lambda line: line.split())    .flatMap(lambda line: line.split())
    .map(lambda token: (token, 1))    .map(lambda token: (token, 1))
-   .reduceByKey(lambda x,y: x + y)+   .reduceByKey(lambda x, y: x + y)
    .sortBy(lambda word_count: word_count[1], ascending=False)    .sortBy(lambda word_count: word_count[1], ascending=False)
    .saveAsTextFile(output))    .saveAsTextFile(output))
Line 65: Line 65:
 </file> </file>
  
-  * run ''word_count.py'' script inside existing Spark cluster (i.e., inside ''spark-qsub'' or ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none:+  * run ''word_count.py'' script inside existing Spark cluster (i.e., inside ''spark-sbatch'' or ''spark-srun''), or start local Spark cluster using as many threads as there are cores if there is none:
   <file>spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file>   <file>spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file>
   * run ''word_count.py'' script with local Spark cluster using one thread:   * run ''word_count.py'' script with local Spark cluster using one thread:
   <file>MASTER=local spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file>   <file>MASTER=local spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file>
-  * start Spark cluster (10 machines, 1GB RAM each) on SGE and run ''word_count.py'' script: +  * start Spark cluster (10 machines, @GB RAM each) using Slurm SGE and run ''word_count.py'' script: 
-  <file>spark-qsub 10 1G spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file>+  <file>spark-sbatch 10 2G spark-submit word_count.py /net/projects/spark-example-data/wiki-cs outdir</file>
  
 ===== Using Virtual Environments ===== ===== Using Virtual Environments =====

[ Back to the navigation ] [ Back to the content ]