[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
spark:recipes:using-perl-via-pipes [2014/11/07 10:38]
straka
spark:recipes:using-perl-via-pipes [2014/11/07 10:59]
straka
Line 6: Line 6:
 Still, this functionality is useful when libraries available only in Perl have to be used. Still, this functionality is useful when libraries available only in Perl have to be used.
  
-Here we show how data can be passed from Python to Perl and back using JSON format, which allows preserving data types -- ''RDD'' elements can be strings, numbers and array (note that Perl has no native tuples).+Here we show how data can be passed from Python/Scala to Perl and back using JSON format, which allows preserving data types -- ''RDD'' elements can be strings, numbers and array (note that Perl has no native tuples). The JSON format has the following advantages: 
 +  - It allows serializing numbers, strings and arrays. 
 +  - The serialized JSON string contains no newlines, which fits the line-oriented Spark piping. 
 +  - Libraries for JSON serialization/deserialization are available.
  
 ===== Using Python and JSON ===== ===== Using Python and JSON =====
  
-Using JSON formatwe can easily serialize and deserialize the data we want to pass from Python to Perl and backJSON format is used because+We start with the Perl script, which reads JSON from stdin linesdecodes them, process them and optinally produces output: 
-  - It allows serializing numbersstrings and arrays+<file perl> 
-  - The serialized JSON string contains no newlines, which fits the line-oriented Spark piping+#!/usr/bin/perl 
-  - Libraries for JSON serialization/deserialization are available in both languages.+use warnings; 
 +use strict; 
 + 
 +use JSON; 
 +my $json = JSON->new->utf8(1)->allow_nonref(1); 
 + 
 +while (<>) { 
 +  my $data = $json->decode($_); 
 + 
 +  # process the data, which can be string, int or an array 
 + 
 +  # for every $output, which can be string, int or an array ref: 
 +  # print $json->encode($output) . "\n"; 
 +
 +</file> 
 + 
 +On the Python side, the Perl script is used in the following way: 
 +<file python> 
 +... 
 +import json 
 +import os 
 +... 
 +# let rdd be an RDD we want to process 
 +rdd.map(json.dumps).pipe("perl script.pl", os.environ).map(json.loads) 
 +</file> 
 + 
 +==== Complete Example using Simple Perl Tokenizer ==== 
 + 
 +Suppose we want to write program which uses Perl Tokenizer and then produces token counts. 
 + 
 +File ''tokenize.pl'' implementing trivial tokenizer (for every input record, it produces an output record for all sentences found, and the output record is an array of tokens): 
 +<file perl> 
 +#!/usr/bin/perl 
 +use warnings; 
 +use strict; 
 + 
 +use JSON; 
 +my $json = JSON->new->utf8(1)->allow_nonref(1); 
 + 
 +while (<>) { 
 +  my $data = $json->decode($_); 
 + 
 +  foreach my $sentence (split(/\s*[.?!]\s*/$data)) { 
 +    my @tokens = split(/\s+/, $sentence); 
 + 
 +    print $json->encode(\@tokens) "\n"; 
 +  
 +
 +</file> 
 + 
 +File ''perl_integration.py'', which is given input and output paths, uses ''tokenize.pl'' script from the current directory and produces token counts: 
 +<file python> 
 +#!/usr/bin/python 
 + 
 +import sys 
 +if len(sys.argv) < 3: 
 +    print >>sys.stderr, "Usage: %s input output" % sys.argv[0] 
 +    exit(1) 
 +input = sys.argv[1] 
 +output = sys.argv[2] 
 + 
 +import json 
 +import os 
 +from pyspark import SparkContext 
 + 
 +sc = SparkContext() 
 +(sc.textFile(input) 
 +   .map(json.dumps).pipe("perl tokenize.pl", os.environ).map(json.loads) 
 +   .flatMap(lambda tokens: map(lambda x: (x, 1), tokens)) 
 +   .reduceByKey(lambda x,y: x + y) 
 +   .saveAsTextFile(output)) 
 +</file> 
 + 
 +It can be executed using ''spark-submit perl_integration.py input output''.
  
-===== Using Scala and Java =====+===== Using Scala and JSON =====
  
-Scala and Java can be used in similar way as Python to communicate with Perl scripts via pipes. Nevertheless, available JSON libraries 
  

[ Back to the navigation ] [ Back to the content ]