Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
spark:using-python [2014/11/10 15:15] straka created |
spark:using-python [2022/12/14 13:25] (current) straka [Usage Examples] |
||
|---|---|---|---|
| Line 8: | Line 8: | ||
| < | < | ||
| - | Better interactive shell with code completion using '' | + | Better interactive shell with code completion using '' |
| - | < | + | < |
| As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable '' | As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable '' | ||
| ==== Usage Examples ==== | ==== Usage Examples ==== | ||
| + | Consider the following simple script computing 10 most frequent words of Czech Wikipedia: | ||
| + | <file python> | ||
| + | (sc.textFile("/ | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | </ | ||
| - | * run interactive shell with local Spark cluster using as many threads as there are cores: | + | * run interactive shell using existing Spark cluster (i.e., inside '' |
| - | < | + | < |
| - | MASTER=" | + | |
| * run interactive shell with local Spark cluster using one thread: | * run interactive shell with local Spark cluster using one thread: | ||
| - | < | + | < |
| - | * start Spark cluster (10 machines, | + | * start Spark cluster (10 machines, |
| - | < | + | < |
| - | Note that '' | + | Note that '' |
| - | ===== Running Scripts ===== | + | ===== Running |
| + | |||
| + | Python Spark scripts can be started using: | ||
| + | < | ||
| + | |||
| + | As described in [[running-spark-on-single-machine-or-on-cluster|Running Spark on Single Machine or on Cluster]], environmental variable '' | ||
| + | |||
| + | ==== Usage Examples ==== | ||
| + | Consider the following simple word-count script '' | ||
| + | <file python> | ||
| + | # | ||
| + | |||
| + | import sys | ||
| + | if len(sys.argv) < 3: | ||
| + | print >> | ||
| + | exit(1) | ||
| + | input = sys.argv[1] | ||
| + | output = sys.argv[2] | ||
| + | |||
| + | from pyspark import SparkContext | ||
| + | |||
| + | sc = SparkContext() | ||
| + | (sc.textFile(input, | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | | ||
| + | sc.stop() | ||
| + | </ | ||
| + | |||
| + | * run '' | ||
| + | < | ||
| + | * run '' | ||
| + | < | ||
| + | * start Spark cluster (10 machines, @GB RAM each) using Slurm and run '' | ||
| + | < | ||
| + | |||
| + | ===== Using Virtual Environments ===== | ||
| + | |||
| + | If you want to use specific virtual environment in your Spark job, use | ||
| + | < | ||
