[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
spark:recipes:using-perl-via-pipes [2014/11/07 10:59]
straka
spark:recipes:using-perl-via-pipes [2014/11/07 13:17]
straka
Line 34: Line 34:
 On the Python side, the Perl script is used in the following way: On the Python side, the Perl script is used in the following way:
 <file python> <file python>
-... 
 import json import json
 import os import os
 +
 ... ...
 +
 # let rdd be an RDD we want to process # let rdd be an RDD we want to process
 rdd.map(json.dumps).pipe("perl script.pl", os.environ).map(json.loads) rdd.map(json.dumps).pipe("perl script.pl", os.environ).map(json.loads)
Line 89: Line 90:
 </file> </file>
  
-It can be executed using ''spark-submit perl_integration.py input output''.+It can be executed using ''spark-submit --files tokenize.pl perl_integration.py input output''.
  
 ===== Using Scala and JSON ===== ===== Using Scala and JSON =====
  
 +The Perl side is the same as in [[#using-python-and-json|Using Python and JSON]].
  
 +The Scala side is a bit more complicated that the Python, because in Scala the ''RDD''s are statically typed. That means that when deserializing JSON, the resulting type must be specialized explicitly. Also using JSON serialization libraries is more verbose, which is why we create wrapper methods for them:
 +<file scala>
 +def encodeJson[T <: AnyRef](src: T): String = {
 +  implicit val formats = org.json4s.jackson.Serialization.formats(org.json4s.NoTypeHints)
 +  return org.json4s.jackson.Serialization.write[T](src)
 +}
 +
 +def decodeJson[T: Manifest](src: String): T = {
 +  implicit val formats = org.json4s.jackson.Serialization.formats(org.json4s.NoTypeHints)
 +  return org.json4s.jackson.Serialization.read[T](src)
 +}
 +
 +...
 +
 +// let rdd be an RDD we want to process, creating ''RDD[ProcessedType]''
 +rdd.map(encodeJson).pipe("perl script.pl").map(decodeJson[ProcessedType])
 +</file>

[ Back to the navigation ] [ Back to the content ]