Differences

This shows you the differences between two versions of the page.

--- spark:recipes:reading-text-files [2014/11/04 14:11]
straka
+++ spark:recipes:reading-text-files [2025/10/15 20:13] (current)
straka [Number of Partitions: Multiple Files in a Directory]
@@ Line 25: / Line 25: @@
 ==== Number of Partitions: Compressed File ====
-If the input file is compressed, it is always read as 1 partition, as splitting cannot be performed efficiently.
+If the input file is compressed with **''gzip''** or **''zip''**, it is always read sequentially as 1 partition because of how the format works. In other words, even a very large file must be read sequentially, so you want to avoid large compressed files in these formats.
-To create multiple partitions, ''repartition'' can be used in the following way:
+On the other hand, files compressed with **''bzip2''** can be **split effectively** (technically, blocks with length at most 900k are compressed independently in bzip2), so it allows parallel processing of very large files.
-<file python>
-lines = sc.textFile(compressed_file).repartition(3*sc.defaultParallelism)
-</file>
 ==== Number of Partitions: Multiple Files in a Directory ====
@@ Line 36: / Line 33: @@
 When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to ''textFile'' is applied only to the first file (if it is not compressed). Other uncompressed files are split only into 32MB chunks, or into 1 partition if compressed.
-Note that when there are many files (as for example in ''/net/projects/spark-example-data/hamledt-cs-conll''), the number of partitions can be quite large, which slows down the computation. In that case, ''coalesce'' can be used to decrease the number of partitions efficiently (by merging existing partitions without running the ''repartition''):
+Note that when there are many files (thousands or more, as for example in ''/net/projects/spark-example-data/hamledt-cs-conll''), the number of partitions can be quite large, which slows down the computation. In that case, ''coalesce'' can be used to decrease the number of partitions efficiently (by merging existing partitions without running the ''repartition''):
 <file python>
 conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
@@ Line 67: / Line 64: @@
 To control the number of partitions, ''repartition'' or ''coalesce'' can be used.
-For example, to read compressed HamleDT Czech CoNLL files, the following can be used:
+For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting ''RDD'', the following can be used:
 <file python>
 conlls = paragraphFile(sc, "/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)
@@ Line 74: / Line 71: @@
 ===== Reading Whole Text Files =====
-To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used.
+To read whole text file or whole text files in a given directory, ''sc.wholeTextFiles'' can be used. Compressed files are supported.
-Unfortunately, ''sc.wholeTextFiles'' **does not** support compressed files.
 <file python>
 whole_wiki = sc.wholeTextFiles("/net/projects/spark-example-data/wiki-cs")
 </file>
+By default, every file is read in separate partitions. To control the number of partitions, ''repartition'' or ''coalesce'' can be used.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences