Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
spark:recipes:reading-text-files [2014/11/04 10:40] straka |
spark:recipes:reading-text-files [2025/10/15 20:13] (current) straka [Number of Partitions: Multiple Files in a Directory] |
||
|---|---|---|---|
| Line 25: | Line 25: | ||
| ==== Number of Partitions: Compressed File ==== | ==== Number of Partitions: Compressed File ==== | ||
| - | If the input file is compressed, it is always read as 1 partition, | + | If the input file is compressed |
| - | To create multiple partitions, '' | + | On the other hand, files compressed with **'' |
| - | <file python> | + | |
| - | lines = sc.textFile(compressed_file).repartition(3*sc.defaultParallelism) | + | |
| - | </ | + | |
| ==== Number of Partitions: Multiple Files in a Directory ==== | ==== Number of Partitions: Multiple Files in a Directory ==== | ||
| Line 36: | Line 33: | ||
| When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | ||
| - | Note that when there are many files (as for example in ''/ | + | Note that when there are many files (thousands or more, as for example in ''/ |
| <file python> | <file python> | ||
| conll_lines = sc.textFile("/ | conll_lines = sc.textFile("/ | ||
| </ | </ | ||
| + | |||
| ===== Reading Text Files by Paragraphs ===== | ===== Reading Text Files by Paragraphs ===== | ||
| + | Although there is no method of '' | ||
| + | Python version: | ||
| + | <file python> | ||
| + | def paragraphFile(sc, | ||
| + | return sc.newAPIHadoopFile(path, | ||
| + | " | ||
| + | conf={" | ||
| + | </ | ||
| - | ===== Reading Whole Text Files ===== | + | Scala version: |
| + | <file scala> | ||
| + | def paragraphFile(sc: | ||
| + | val conf = new org.apache.hadoop.conf.Configuration() | ||
| + | conf.set(" | ||
| + | return sc.newAPIHadoopFile(path, | ||
| + | classOf[org.apache.hadoop.io.LongWritable], | ||
| + | } | ||
| + | </ | ||
| - | To read whole text file or whole text files in a given directory, '' | + | Compressed |
| - | Unfortunately, '' | + | To control the number of partitions, '' |
| + | For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting '' | ||
| <file python> | <file python> | ||
| - | conlls = sc.textFile("/ | + | conlls = paragraphFile(sc, "/ |
| </ | </ | ||
| + | |||
| + | ===== Reading Whole Text Files ===== | ||
| + | |||
| + | To read whole text file or whole text files in a given directory, '' | ||
| + | |||
| + | <file python> | ||
| + | whole_wiki = sc.wholeTextFiles("/ | ||
| + | </ | ||
| + | |||
| + | By default, every file is read in separate partitions. To control the number of partitions, '' | ||
