[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
spark:recipes:using-perl-via-pipes [2014/11/07 13:17]
straka
spark:recipes:using-perl-via-pipes [2014/11/07 14:11]
straka
Line 43: Line 43:
 </file> </file>
  
-==== Complete Example using Simple Perl Tokenizer ====+==== Complete Example using Simple Perl Tokenizer and Python ====
  
 Suppose we want to write program which uses Perl Tokenizer and then produces token counts. Suppose we want to write program which uses Perl Tokenizer and then produces token counts.
Line 84: Line 84:
 sc = SparkContext() sc = SparkContext()
 (sc.textFile(input) (sc.textFile(input)
-   .map(json.dumps).pipe("perl tokenize.pl", os.environ).map(json.loads)+   .map(json.dumps).pipe("env perl tokenize.pl", os.environ).map(json.loads)
    .flatMap(lambda tokens: map(lambda x: (x, 1), tokens))    .flatMap(lambda tokens: map(lambda x: (x, 1), tokens))
    .reduceByKey(lambda x,y: x + y)    .reduceByKey(lambda x,y: x + y)
Line 90: Line 90:
 </file> </file>
  
-It can be executed using ''spark-submit --files tokenize.pl perl_integration.py input output''.+It can be executed using ''spark-submit perl_integration.py input output''.
  
 ===== Using Scala and JSON ===== ===== Using Scala and JSON =====
Line 113: Line 113:
 rdd.map(encodeJson).pipe("perl script.pl").map(decodeJson[ProcessedType]) rdd.map(encodeJson).pipe("perl script.pl").map(decodeJson[ProcessedType])
 </file> </file>
 +

[ Back to the navigation ] [ Back to the content ]