Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
spark:recipes:reading-text-files [2014/11/04 10:50] straka |
spark:recipes:reading-text-files [2014/11/04 14:13] straka |
||
---|---|---|---|
Line 13: | Line 13: | ||
The elements of the resulting '' | The elements of the resulting '' | ||
- | === Number of Partitions: Uncompressed File === | + | ==== Number of Partitions: Uncompressed File ==== |
If the input file is not compressed, it is split into 32MB chunks, but in at least 2 partitions. The minimum number of partitions (instead of default 2) can be specified as the second argument of '' | If the input file is not compressed, it is split into 32MB chunks, but in at least 2 partitions. The minimum number of partitions (instead of default 2) can be specified as the second argument of '' | ||
Line 23: | Line 23: | ||
</ | </ | ||
- | === Number of Partitions: Compressed File === | + | ==== Number of Partitions: Compressed File ==== |
If the input file is compressed, it is always read as 1 partition, as splitting cannot be performed efficiently. | If the input file is compressed, it is always read as 1 partition, as splitting cannot be performed efficiently. | ||
Line 32: | Line 32: | ||
</ | </ | ||
- | === Number of Partitions: Multiple Files in a Directory === | + | ==== Number of Partitions: Multiple Files in a Directory |
When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | When the input file is a directory, each file is read in separate partitions. The minimum number of partitions given as second argument to '' | ||
Line 67: | Line 67: | ||
To control the number of partitions, '' | To control the number of partitions, '' | ||
- | For example, to read compressed HamleDT Czech CoNLL files, the following can be used: | + | For example, to read compressed HamleDT Czech CoNLL files, so that every sentence is one element of the resulting '' |
<file python> | <file python> | ||
conlls = paragraphFile(sc, | conlls = paragraphFile(sc, |