Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
courses:mapreduce-tutorial:step-29 [2012/01/29 17:40] straka |
courses:mapreduce-tutorial:step-29 [2012/01/30 00:40] straka |
We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split. | We start by creating ''FileAsPathInputFormat'', which reads any file, splits it and for each split return exactly one input pair (file_path, start-length) with types (''Text'', ''Text''), where ''file_path'' is path to the file and ''start-length'' is a string containing two dash-separated numbers: start offset of the split and length of the split. |
| |
| When implementing new input format, we must |
| * decide whether the input files are splittable. Usually uncompressed files are splittable and compressed files are not splittable, with the exception of ''SequenceFile'', which is always splittable. |
| * implement [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/RecordReader.html|RecordReader<K, V>]]. The ''RecordReader'' is the one doing the real work -- it is given a file split and it reads (key, value) pairs of types (K, V), until there are any. |
| |
| Our ''FileAsPathInputFormat'' is simple -- we allow splitting of uncompressed file and the ''RecordReader'' reads exactly one input pair. |
<code java> | <code java> |
public static class FileAsPathInputFormat extends FileInputFormat<Text, Text> { | public class FileAsPathInputFormat extends FileInputFormat<Text, Text> { |
public static class FileAsPathRecordReader extends RecordReader<Text, Text> { | public static class FileAsPathRecordReader extends RecordReader<Text, Text> { |
private Path file; | private Path file; |