Using Python

In order to use Spark in Python, environment has to bee set up according to Using Spark in UFAL Environment.

Starting Interactive Shell

Interactive shell can be started using:

pyspark

Better interactive shell with code completion using ipython (installed everywhere on cluster; ask our IT if you want to have it installed on your workstations too) can be started using:

IPYTHON=1 pyspark

As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER specifies which Spark master to use (or whether to start a local one).

Usage Examples

Consider the following simple script computing 10 most frequent words of Czech Wikipedia:

(sc.textFile("/net/projects/spark-example-data/wiki-cs")
   .flatMap(lambda line: line.split())
   .map(lambda word: (word, 1))
   .reduceByKey(lambda c1,c2: c1+c2)
   .sortBy(lambda (word,count): count, ascending=False)
   .take(10))

run interactive shell inside spark-qrsh, or start local Spark cluster using as many threads as there are cores:

IPYTHON=1 pyspark

run interactive shell with local Spark cluster using one thread:

MASTER=local IPYTHON=1 pyspark

start Spark cluster (10 machines, 1GB RAM each) on SGE and run interactive shell:

IPYTHON=1 spark-qrsh 10 1G pyspark

Note that IPYTHON variable can be left out or specified in .bashrc (or similar).

Running Python Spark Scripts

Python Spark scripts can be started using:

spark-submit

As described in Running Spark on Single Machine or on Cluster, environmental variable MASTER specifies which Spark master to use (or whether to start a local one).

Usage Examples

Consider the following simple word-count script word_count.py:

#!/usr/bin/python
 
import sys
if len(sys.argv) < 3:
    print >>sys.stderr, "Usage: %s input output" % sys.argv[0]
    exit(1)
input = sys.argv[1]
output = sys.argv[2]
 
from pyspark import SparkContext
 
sc = SparkContext()
(sc.textFile(input)
   .flatMap(lambda line: line.split())
   .map(lambda token: (token, 1))
   .reduceByKey(lambda x,y: x + y)
   .sortBy(lambda (word,count): count, ascending=False)
   .saveAsTextFile(output))
sc.stop()

run word_count.py script inside spark-qsub, spark-qrsh, or start local Spark cluster using as many threads as there are cores:

spark-submit word_count.py input output

run word_count.py script with local Spark cluster using one thread:

MASTER=local spark-submit word_count.py input output

start Spark cluster (10 machines, 1GB RAM each) on SGE and run word_count.py script:

spark-qsub 10 1G spark-submit word_count.py input output

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Using Python

Starting Interactive Shell

Usage Examples

Running Python Spark Scripts

Usage Examples