Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
spark:using-python [2014/11/10 15:32] straka |
spark:using-python [2014/11/11 09:31] straka |
||
---|---|---|---|
Line 14: | Line 14: | ||
==== Usage Examples ==== | ==== Usage Examples ==== | ||
+ | Consider the following simple script computing 10 most frequent words of Czech Wikipedia: | ||
+ | <file python> | ||
+ | (sc.textFile("/ | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | </ | ||
- | * run interactive shell inside '' | + | * run interactive shell using existing Spark cluster (i.e., |
< | < | ||
* run interactive shell with local Spark cluster using one thread: | * run interactive shell with local Spark cluster using one thread: | ||
Line 27: | Line 36: | ||
===== Running Python Spark Scripts ===== | ===== Running Python Spark Scripts ===== | ||
- | Python Spark scripts can be started using '' | + | Python Spark scripts can be started using: |
+ | < | ||
As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable '' | As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable '' | ||
Line 44: | Line 54: | ||
from pyspark import SparkContext | from pyspark import SparkContext | ||
+ | |||
sc = SparkContext() | sc = SparkContext() | ||
- | (sc.textFile(input) | + | (sc.textFile(input, 3*sc.defaultParallelism) |
| | ||
| | ||
| | ||
+ | | ||
| | ||
sc.stop() | sc.stop() | ||
</ | </ | ||
- | * run '' | + | * run '' |
- | < | + | < |
* run '' | * run '' | ||
- | < | + | < |
* start Spark cluster (10 machines, 1GB RAM each) on SGE and run '' | * start Spark cluster (10 machines, 1GB RAM each) on SGE and run '' | ||
- | < | + | < |