Differences

This shows you the differences between two versions of the page.

--- spark:recipes:reading-text-files [2014/11/04 10:40]
straka
+++ spark:recipes:reading-text-files [2016/03/31 22:02] (current)
straka
@@ Line 40: / Line 40: @@
 conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 </file>
 ===== Reading Text Files by Paragraphs =====
+Although there is no method of ''sc'' which reads files by paragraphs, it can be written easily.
+Python version:
+<file python>
+def paragraphFile(sc, path):
+    return sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
+            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
+            conf={"textinputformat.record.delimiter": "\n\n"}).map(lambda num_line:num_line[1])
+</file>
-===== Reading Whole Text Files =====
+Scala version:
+<file scala>
+def paragraphFile(sc:org.apache.spark.SparkContext, path:String) : org.apache.spark.rdd.RDD[String] = {
+    val conf = new org.apache.hadoop.conf.Configuration()
+    conf.set("textinputformat.record.delimiter", "\n\n")
+    return sc.newAPIHadoopFile(path, classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat],
+        classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf).map(_._2.toString)
+}
+</file>
-To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used.
+Compressed files are supported and each compressed file is read into 1 partition. Uncompressed files are split into 32MB chunks.
-Unfortunately, ''sc.wholeTextFiles'' **does not** support compressed files.
+To control the number of partitions, ''repartition'' or ''coalesce'' can be used.
+For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
 <file python>
-conlls = sc.textFile("/net/projects/spark-example-data/wiki-cs")
+conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 </file>
+===== Reading Whole Text Files =====
+To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. Compressed files are supported.
+<file python>
+whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
+</file>
+By default, every file is read in separate partitions. To control the number of partitions, ''repartition'' or ''coalesce'' can be used.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences