[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
spark:recipes:reading-text-files [2014/11/04 10:40]
straka
spark:recipes:reading-text-files [2014/11/04 14:13]
straka
Line 40: Line 40:
 conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism) conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 </file> </file>
 +
  
 ===== Reading Text Files by Paragraphs ===== ===== Reading Text Files by Paragraphs =====
  
 +Although there is no method of ''sc'' which reads files by paragraphs, it can be written easily.
 +Python version:
 +<file python>
 +def paragraphFile(sc, path):
 +    return sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
 +            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
 +            conf={"textinputformat.record.delimiter": "\n\n"}).map(lambda num_line:num_line[1])
 +</file>
 +
 +Scala version:
 +<file scala>
 +def paragraphFile(sc:org.apache.spark.SparkContext, path:String) : org.apache.spark.rdd.RDD[String] = {
 +    val conf = new org.apache.hadoop.conf.Configuration()
 +    conf.set("textinputformat.record.delimiter", "\n\n")
 +    return sc.newAPIHadoopFile(path, classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
 +        classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf).map(_._2.toString)
 +}
 +</file>
 +
 +Compressed files are supported and each compressed file is read into 1 partition. Uncompressed files are split into 32MB chunks.
 +
 +To control the number of partitions, ''repartition'' or ''coalesce'' can be used. 
 +
 +For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
 +<file python>
 +conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 +</file>
  
 ===== Reading Whole Text Files ===== ===== Reading Whole Text Files =====
Line 51: Line 79:
  
 <file python> <file python>
-conlls = sc.textFile("/net/projects/spark-example-data/wiki-cs")+whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
 </file> </file>

[ Back to the navigation ] [ Back to the content ]