Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
spark:recipes:reading-text-files [2014/11/04 14:11] straka |
spark:recipes:reading-text-files [2025/10/15 20:13] (current) straka [Number of Partitions: Multiple Files in a Directory] |
||
|---|---|---|---|
| Line 25: | Line 25: | ||
| ==== Number of Partitions: Compressed File ==== | ==== Number of Partitions: Compressed File ==== | ||
| - | If the input file is compressed, it is always read as 1 partition, | + | If the input file is compressed |
| - | To create multiple partitions, '' | + | On the other hand, files compressed with **'' |
| - | <file python> | + | |
| - | lines = sc.textFile(compressed_file).repartition(3*sc.defaultParallelism) | + | |
| - | </ | + | |
| ==== Number of Partitions: Multiple Files in a Directory ==== | ==== Number of Partitions: Multiple Files in a Directory ==== | ||
| Line 36: | Line 33: | ||
| When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | ||
| - | Note that when there are many files (as for example in ''/ | + | Note that when there are many files (thousands or more, as for example in ''/ |
| <file python> | <file python> | ||
| conll_lines = sc.textFile("/ | conll_lines = sc.textFile("/ | ||
| Line 67: | Line 64: | ||
| To control the number of partitions, '' | To control the number of partitions, '' | ||
| - | For example, to read compressed HamleDT Czech CoNLL files, the following can be used: | + | For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting '' |
| <file python> | <file python> | ||
| conlls = paragraphFile(sc, | conlls = paragraphFile(sc, | ||
| Line 74: | Line 71: | ||
| ===== Reading Whole Text Files ===== | ===== Reading Whole Text Files ===== | ||
| - | To read whole text file or whole text files in a given directory, '' | + | To read whole text file or whole text files in a given directory, '' |
| - | + | ||
| - | Unfortunately, | + | |
| <file python> | <file python> | ||
| whole_wiki = sc.wholeTextFiles("/ | whole_wiki = sc.wholeTextFiles("/ | ||
| </ | </ | ||
| + | |||
| + | By default, every file is read in separate partitions. To control the number of partitions, '' | ||
