[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
spark:recipes:reading-text-files [2014/11/04 10:40]
straka
spark:recipes:reading-text-files [2016/03/31 22:02] (current)
straka
Line 40: Line 40:
 conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism) conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 </file> </file>
 +
  
 ===== Reading Text Files by Paragraphs ===== ===== Reading Text Files by Paragraphs =====
  
 +Although there is no method of ''sc'' which reads files by paragraphs, it can be written easily.
 +Python version:
 +<file python>
 +def paragraphFile(sc, path):
 +    return sc.newAPIHadoopFile(path, "org.apache.hadoop.mapreduce.lib.input.TextInputFormat",
 +            "org.apache.hadoop.io.LongWritable", "org.apache.hadoop.io.Text",
 +            conf={"textinputformat.record.delimiter": "\n\n"}).map(lambda num_line:num_line[1])
 +</file>
  
-===== Reading Whole Text Files =====+Scala version: 
 +<file scala> 
 +def paragraphFile(sc:org.apache.spark.SparkContext, path:String) : org.apache.spark.rdd.RDD[String] 
 +    val conf new org.apache.hadoop.conf.Configuration() 
 +    conf.set("textinputformat.record.delimiter", "\n\n"
 +    return sc.newAPIHadoopFile(path, classOf[org.apache.hadoop.mapreduce.lib.input.TextInputFormat], 
 +        classOf[org.apache.hadoop.io.LongWritable], classOf[org.apache.hadoop.io.Text], conf).map(_._2.toString) 
 +
 +</file>
  
-To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used.+Compressed files are supported and each compressed file is read into 1 partitionUncompressed files are split into 32MB chunks.
  
-Unfortunately, ''sc.wholeTextFiles'' **does not** support compressed files.+To control the number of partitions, ''repartition'' or ''coalesce'' can be used
  
 +For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
 <file python> <file python>
-conlls = sc.textFile("/net/projects/spark-example-data/wiki-cs")+conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
 </file> </file>
 +
 +===== Reading Whole Text Files =====
 +
 +To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. Compressed files are supported.
 +
 +<file python>
 +whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
 +</file>
 +
 +By default, every file is read in separate partitions. To control the number of partitions, ''repartition'' or ''coalesce'' can be used. 

[ Back to the navigation ] [ Back to the content ]