[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
spark:recipes:reading-text-files [2014/11/04 10:50]
straka
spark:recipes:reading-text-files [2016/03/31 22:02] (current)
straka
Line 13: Line 13:
 The elements of the resulting ''RDD'' are lines of the input file. The elements of the resulting ''RDD'' are lines of the input file.
  
-=== Number of Partitions: Uncompressed File ===+==== Number of Partitions: Uncompressed File ====
  
 If the input file is not compressed, it is split into 32MB chunks, but in at least 2 partitions. The minimum number of partitions (instead of default 2) can be specified as the second argument of ''textFile''. If the input file is not compressed, it is split into 32MB chunks, but in at least 2 partitions. The minimum number of partitions (instead of default 2) can be specified as the second argument of ''textFile''.
Line 23: Line 23:
 </file> </file>
  
-=== Number of Partitions: Compressed File ===+==== Number of Partitions: Compressed File ====
  
 If the input file is compressed, it is always read as 1 partition, as splitting cannot be performed efficiently. If the input file is compressed, it is always read as 1 partition, as splitting cannot be performed efficiently.
Line 32: Line 32:
 </file> </file>
  
-=== Number of Partitions: Multiple Files in a Directory ===+==== Number of Partitions: Multiple Files in a Directory ====
  
 When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to ''textFile'' is applied only to the first file (if it is not compressed). Other uncompressed files are split only into 32MB chunks, or into 1 partition if compressed. When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to ''textFile'' is applied only to the first file (if it is not compressed). Other uncompressed files are split only into 32MB chunks, or into 1 partition if compressed.
Line 67: Line 67:
 To control the number of partitions, ''repartition'' or ''coalesce'' can be used.  To control the number of partitions, ''repartition'' or ''coalesce'' can be used. 
  
-For example, to read compressed HamleDT Czech CoNLL files, the following can be used:+For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
 <file python> <file python>
 conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism) conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
Line 74: Line 74:
 ===== Reading Whole Text Files ===== ===== Reading Whole Text Files =====
  
-To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. +To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. Compressed files are supported.
- +
-Unfortunately, ''sc.wholeTextFiles'' **does not** support compressed files.+
  
 <file python> <file python>
 whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs") whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
 </file> </file>
 +
 +By default, every file is read in separate partitions. To control the number of partitions, ''repartition'' or ''coalesce'' can be used. 

[ Back to the navigation ] [ Back to the content ]