[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
spark:recipes:reading-text-files [2014/11/04 14:11]
straka
spark:recipes:reading-text-files [2016/03/31 22:02] (current)
straka
Line 67: Line 67:
 To control the number of partitions, ''repartition'' or ''coalesce'' can be used.  To control the number of partitions, ''repartition'' or ''coalesce'' can be used. 
  
-For example, to read compressed HamleDT Czech CoNLL files, the following can be used:+For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
 <file python> <file python>
 conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism) conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
Line 74: Line 74:
 ===== Reading Whole Text Files ===== ===== Reading Whole Text Files =====
  
-To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. +To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. Compressed files are supported.
- +
-Unfortunately, ''sc.wholeTextFiles'' **does not** support compressed files.+
  
 <file python> <file python>
 whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs") whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
 </file> </file>
 +
 +By default, every file is read in separate partitions. To control the number of partitions, ''repartition'' or ''coalesce'' can be used. 

[ Back to the navigation ] [ Back to the content ]