Using Perl via Pipes

Perl can be used to process RDD elements using pipes. Although this allows using Perl libraries for tokenization/parsing/etc., it is only a limited Perl integration in Spark. Notably:

A driver program in Python (Scala,Java) still has to exist.
Perl programs can operate only on individual RDD elements, meaning that more complex operations (reduceByKey, union, join, sortBy, i.e., operations defining order of multiple elements or joining of multiple elements) can be implemented in the driver program only.

Still, this functionality is useful when libraries available only in Perl have to be used.

Here we show how data can be passed from Python to Perl and back using JSON format, which allows preserving data types – RDD elements can be strings, numbers and array (note that Perl has no native tuples).

Using Python and JSON

Using JSON format, we can easily serialize and deserialize the data we want to pass from Python to Perl and back. JSON format is used because:

It allows serializing numbers, strings and arrays.
The serialized JSON string contains no newlines, which fits the line-oriented Spark piping.
Libraries for JSON serialization/deserialization are available in both languages.

Using Scala and Java

Scala and Java can be used in similar way as Python to communicate with Perl scripts via pipes. Nevertheless, available JSON libraries

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Using Perl via Pipes

Using Python and JSON

Using Scala and Java