MapReduce Tutorial : Exercise - sorting

You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format:

Path	Size
/net/projects/hadoop/examples/inputs/numbers-small	3MB
/net/projects/hadoop/examples/inputs/numbers-medium	184MB
/net/projects/hadoop/examples/inputs/numbers-large	916MB

You can assume that the integers are uniformly distributed.

Your task is to sort these data, comparing the key numerically and not lexicographically. The lines in the output must be the same as in the input, only in different order.

Your solution should work for TBs of data. For that reason, you must use multiple reducers. If your job is executed using r reducers, the output consists of r files, which when concatenated would produce sorted (key, value) pairs. In other words, each of the output files contains sorted (integer, data) pairs and all keys in one file are either smaller or larger than in other file. Your solution should work for any value r – this value is given to the partitioner as its fourth argument.

Nonuniform data

Assuming uniformity of the integer keys is quite a big deal. Try improving your solution to sort keys with any distribution. You can use the exponentially distributed data available here:

Path	Size
/net/projects/hadoop/examples/inputs/nonuniform-small	3MB
/net/projects/hadoop/examples/inputs/nonuniform-medium	160MB
/net/projects/hadoop/examples/inputs/nonuniform-large	797MB

Assume we want to produce r output files. One of the solutions is to perform two Hadoop jobs:

Go through the data and sample only a small fraction of the keys. As there are not so many of them, they can fit in one reducer.
Find best r-1 integer separators using the sampled data.
Run the second pass, using the separators to guide the partitioning.

Step 12: Additional output from mappers and reducers.

Overview

Step 14: N-gram language model.