Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
spark:recipes:reading-text-files [2014/11/04 10:40] straka |
spark:recipes:reading-text-files [2014/11/04 14:11] straka |
||
---|---|---|---|
Line 40: | Line 40: | ||
conll_lines = sc.textFile("/ | conll_lines = sc.textFile("/ | ||
</ | </ | ||
+ | |||
===== Reading Text Files by Paragraphs ===== | ===== Reading Text Files by Paragraphs ===== | ||
+ | Although there is no method of '' | ||
+ | Python version: | ||
+ | <file python> | ||
+ | def paragraphFile(sc, | ||
+ | return sc.newAPIHadoopFile(path, | ||
+ | " | ||
+ | conf={" | ||
+ | </ | ||
+ | |||
+ | Scala version: | ||
+ | <file scala> | ||
+ | def paragraphFile(sc: | ||
+ | val conf = new org.apache.hadoop.conf.Configuration() | ||
+ | conf.set(" | ||
+ | return sc.newAPIHadoopFile(path, | ||
+ | classOf[org.apache.hadoop.io.LongWritable], | ||
+ | } | ||
+ | </ | ||
+ | |||
+ | Compressed files are supported and each compressed file is read into 1 partition. Uncompressed files are split into 32MB chunks. | ||
+ | |||
+ | To control the number of partitions, '' | ||
+ | |||
+ | For example, to read compressed HamleDT Czech CoNLL files, the following can be used: | ||
+ | <file python> | ||
+ | conlls = paragraphFile(sc, | ||
+ | </ | ||
===== Reading Whole Text Files ===== | ===== Reading Whole Text Files ===== | ||
Line 51: | Line 79: | ||
<file python> | <file python> | ||
- | conlls | + | whole_wiki |
</ | </ |