Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:mapreduce-tutorial:step-13 [2012/01/25 22:53] straka |
courses:mapreduce-tutorial:step-13 [2012/01/26 23:13] straka |
||
---|---|---|---|
Line 3: | Line 3: | ||
You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format: | You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format: | ||
^ Path ^ Size ^ | ^ Path ^ Size ^ | ||
- | | /home/straka/hadoop/example-inputs/ | + | | /net/projects/hadoop/examples/inputs/ |
- | | /home/straka/hadoop/example-inputs/ | + | | /net/projects/hadoop/examples/inputs/ |
- | | /home/straka/hadoop/example-inputs/ | + | | /net/projects/hadoop/examples/inputs/ |
You can assume that the integers are uniformly distributed. | You can assume that the integers are uniformly distributed. | ||
Line 12: | Line 12: | ||
===== Nonuniform data ===== | ===== Nonuniform data ===== | ||
- | The | + | Assuming uniformity of the integer keys is quite a big deal. Try improving your solution to sort keys with any distribution. You can use the exponentially distributed data available here: |
- | After solving | + | |
^ Path ^ Size ^ | ^ Path ^ Size ^ | ||
- | | /home/straka/hadoop/example-inputs/ | + | | /net/projects/hadoop/examples/inputs/ |
- | | /home/straka/hadoop/example-inputs/ | + | | /net/projects/hadoop/examples/inputs/ |
- | | /home/straka/hadoop/example-inputs/ | + | | /net/projects/hadoop/examples/inputs/ |
- | After you | + | |
+ | Assume we want to produce //r// output files. One of the solutions is to perform two Hadoop jobs: | ||
+ | - Go through the data and sample only a small fraction of the keys. As there are not so many of them, they can fit in one reducer. | ||
+ | - Find best //r-1// integer separators using the sampled data. | ||
+ | - Run the second pass, using the separators to guide the partitioning. | ||