Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
spark:recipes:reading-text-files [2014/11/04 10:00] straka |
spark:recipes:reading-text-files [2016/03/31 22:02] (current) straka |
||
---|---|---|---|
Line 5: | Line 5: | ||
===== Reading Text Files by Lines ===== | ===== Reading Text Files by Lines ===== | ||
- | To read text file(s) line by line, '' | + | To read text file(s) line by line, '' |
<file python> | <file python> | ||
Line 13: | Line 13: | ||
The elements of the resulting '' | The elements of the resulting '' | ||
- | ==== Number of Partitions ==== | + | ==== Number of Partitions: Uncompressed File ==== |
- | By default, | + | If the input file is not compressed, it is split into 32MB chunks, but in at least 2 partitions. The minimum number of partitions (instead of default 2) can be specified as the second argument of '' |
- | Note that the number of '' | + | Note that the number of '' |
<file python> | <file python> | ||
- | nes = sc.textFile("/ | + | lines = sc.textFile("/ |
</ | </ | ||
+ | ==== Number of Partitions: Compressed File ==== | ||
+ | |||
+ | If the input file is compressed, it is always read as 1 partition, as splitting cannot be performed efficiently. | ||
+ | |||
+ | To create multiple partitions, '' | ||
+ | <file python> | ||
+ | lines = sc.textFile(compressed_file).repartition(3*sc.defaultParallelism) | ||
+ | </ | ||
+ | |||
+ | ==== Number of Partitions: Multiple Files in a Directory ==== | ||
+ | |||
+ | When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | ||
+ | |||
+ | Note that when there are many files (as for example in ''/ | ||
+ | <file python> | ||
+ | conll_lines = sc.textFile("/ | ||
+ | </ | ||
+ | |||
+ | |||
+ | ===== Reading Text Files by Paragraphs ===== | ||
+ | |||
+ | Although there is no method of '' | ||
+ | Python version: | ||
+ | <file python> | ||
+ | def paragraphFile(sc, | ||
+ | return sc.newAPIHadoopFile(path, | ||
+ | " | ||
+ | conf={" | ||
+ | </ | ||
+ | |||
+ | Scala version: | ||
+ | <file scala> | ||
+ | def paragraphFile(sc: | ||
+ | val conf = new org.apache.hadoop.conf.Configuration() | ||
+ | conf.set(" | ||
+ | return sc.newAPIHadoopFile(path, | ||
+ | classOf[org.apache.hadoop.io.LongWritable], | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Compressed files are supported and each compressed file is read into 1 partition. Uncompressed files are split into 32MB chunks. | ||
+ | |||
+ | To control the number of partitions, '' | ||
+ | |||
+ | For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting '' | ||
+ | <file python> | ||
+ | conlls = paragraphFile(sc, | ||
+ | </ | ||
+ | |||
+ | ===== Reading Whole Text Files ===== | ||
+ | |||
+ | To read whole text file or whole text files in a given directory, '' | ||
+ | |||
+ | <file python> | ||
+ | whole_wiki = sc.wholeTextFiles("/ | ||
+ | </ | ||
+ | |||
+ | By default, every file is read in separate partitions. To control the number of partitions, '' |