[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
spark:recipes:reading-text-files [2025/10/15 20:12]
straka [Number of Partitions: Multiple Files in a Directory]
spark:recipes:reading-text-files [2025/10/15 20:13] (current)
straka [Number of Partitions: Multiple Files in a Directory]
Line 33: Line 33:
 When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to ''textFile'' is applied only to the first file (if it is not compressed). Other uncompressed files are split only into 32MB chunks, or into 1 partition if compressed. When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to ''textFile'' is applied only to the first file (if it is not compressed). Other uncompressed files are split only into 32MB chunks, or into 1 partition if compressed.
  
-Note that when there are many files (thousands or more), the number of partitions can be quite large, which slows down the computation. In that case, ''coalesce'' can be used to decrease the number of partitions efficiently (by merging existing partitions without running the ''repartition''):+Note that when there are many files (thousands or more, as for example in ''/net/projects/spark-example-data/hamledt-cs-conll''), the number of partitions can be quite large, which slows down the computation. In that case, ''coalesce'' can be used to decrease the number of partitions efficiently (by merging existing partitions without running the ''repartition''):
 <file python> <file python>
 conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism) conll_lines = sc.textFile("/net/projects/spark-example-data/hamledt-cs-conll").coalesce(3*sc.defaultParallelism)

[ Back to the navigation ] [ Back to the content ]