[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
spark:using-python [2014/11/10 15:32]
straka
spark:using-python [2014/11/11 09:28]
straka
Line 14: Line 14:
  
 ==== Usage Examples ==== ==== Usage Examples ====
 +Consider the following simple script computing 10 most frequent words of Czech Wikipedia:
 +<file python>
 +(sc.textFile("/net/projects/spark-example-data/wiki-cs", 3*sc.defaultParallelism)
 +   .flatMap(lambda line: line.split())
 +   .map(lambda word: (word, 1))
 +   .reduceByKey(lambda c1,c2: c1+c2)
 +   .sortBy(lambda (word,count): count, ascending=False)
 +   .take(10))
 +</file>
  
-  * run interactive shell inside ''spark-qrsh'', or start local Spark cluster using as many threads as there are cores:+  * run interactive shell using existing Spark cluster (i.e., inside ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none:
   <file>IPYTHON=1 pyspark</file>   <file>IPYTHON=1 pyspark</file>
   * run interactive shell with local Spark cluster using one thread:   * run interactive shell with local Spark cluster using one thread:
Line 27: Line 36:
 ===== Running Python Spark Scripts ===== ===== Running Python Spark Scripts =====
  
-Python Spark scripts can be started using ''spark-submit''+Python Spark scripts can be started using
 +<file>spark-submit</file>
  
 As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one). As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable ''MASTER'' specifies which Spark master to use (or whether to start a local one).
Line 44: Line 54:
  
 from pyspark import SparkContext from pyspark import SparkContext
 +   
 sc = SparkContext() sc = SparkContext()
-(sc.textFile(input)+(sc.textFile(input, 3*sc.defaultParallelism)
    .flatMap(lambda line: line.split())    .flatMap(lambda line: line.split())
    .map(lambda token: (token, 1))    .map(lambda token: (token, 1))
    .reduceByKey(lambda x,y: x + y)    .reduceByKey(lambda x,y: x + y)
 +   .sortBy(lambda (word,count): count, ascending=False)
    .saveAsTextFile(output))    .saveAsTextFile(output))
 sc.stop() sc.stop()
 </file> </file>
  
-  * run ''word_count.py'' script inside ''spark-qsub''''spark-qrsh'', or start local Spark cluster using as many threads as there are cores:+  * run ''word_count.py'' script inside existing Spark cluster (i.e., inside ''spark-qsub'' or ''spark-qrsh''), or start local Spark cluster using as many threads as there are cores if there is none:
   <file>spark-submit word_count.py input output</file>   <file>spark-submit word_count.py input output</file>
   * run ''word_count.py'' script with local Spark cluster using one thread:   * run ''word_count.py'' script with local Spark cluster using one thread:
Line 60: Line 71:
   * start Spark cluster (10 machines, 1GB RAM each) on SGE and run ''word_count.py'' script:   * start Spark cluster (10 machines, 1GB RAM each) on SGE and run ''word_count.py'' script:
   <file>spark-qsub 10 1G spark-submit word_count.py input output</file>   <file>spark-qsub 10 1G spark-submit word_count.py input output</file>
 +

[ Back to the navigation ] [ Back to the content ]