Differences

This shows you the differences between two versions of the page.

--- spark:recipes:using-perl-via-pipes [2014/11/07 11:12]
straka
+++ spark:recipes:using-perl-via-pipes [2014/11/07 14:16]
straka
@@ Line 43: / Line 43: @@
 </file>
-==== Complete Example using Simple Perl Tokenizer ====
+==== Complete Example using Simple Perl Tokenizer and Python ====
 Suppose we want to write program which uses Perl Tokenizer and then produces token counts.
@@ Line 84: / Line 84: @@
 sc = SparkContext()
 (sc.textFile(input)
-   .map(json.dumps).pipe("perl tokenize.pl", os.environ).map(json.loads)
+   .map(json.dumps).pipe("env perl tokenize.pl", os.environ).map(json.loads)
    .flatMap(lambda tokens: map(lambda x: (x, 1), tokens))
    .reduceByKey(lambda x,y: x + y)
    .saveAsTextFile(output))
+sc.stop()
 </file>
-It can be executed using ''spark-submit perl_integration.py input output''.
+It can be executed using ''spark-submit --files tokenize.pl perl_integration.py input output''. Note that the Perl script has to be added to the list of files used by the job.
 ===== Using Scala and JSON =====
@@ Line 111: / Line 112: @@
 // let rdd be an RDD we want to process, creating ''RDD[ProcessedType]''
-rdd.map(encodeJson).pipe("perl script").map(decodeJson[ProcessedType])
+rdd.map(encodeJson).pipe("perl script.pl").map(decodeJson[ProcessedType])
 </file>

Institute of Formal and Applied Linguistics Wiki