Differences

This shows you the differences between two versions of the page.

--- spark:recipes:reading-text-files [2014/11/04 10:40]
straka
+++ spark:recipes:reading-text-files [2014/11/04 14:13]
straka
@@ Line 40: / Line 40: @@
 conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 </file>
 ===== Reading Text Files by Paragraphs =====
+Although there is no method of ''sc'' which reads files by paragraphs, it can be written easily.
+Python version:
+<file python>
+def paragraphFile(sc, path):
+    return sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
+            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
+            conf={"textinputformat.record.delimiter": "\n\n"}).map(lambda num_line:num_line[1])
+</file>
+Scala version:
+<file scala>
+def paragraphFile(sc:org.apache.spark.SparkContext, path:String) : org.apache.spark.rdd.RDD[String] = {
+    val conf = new org.apache.hadoop.conf.Configuration()
+    conf.set("textinputformat.record.delimiter", "\n\n")
+    return sc.newAPIHadoopFile(path, classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
+        classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf).map(_._2.toString)
+}
+</file>
+Compressed files are supported and each compressed file is read into 1 partition. Uncompressed files are split into 32MB chunks.
+To control the number of partitions, ''repartition'' or ''coalesce'' can be used.
+For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
+<file python>
+conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
+</file>
 ===== Reading Whole Text Files =====
@@ Line 51: / Line 79: @@
 <file python>
-conlls = sc.textFile("/net/projects/spark-example-data/wiki-cs")
+whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
 </file>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences