Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-13 [2012/01/25 22:53]
straka
+++ courses:mapreduce-tutorial:step-13 [2012/01/25 23:00]
straka
@@ Line 12: / Line 12: @@
 ===== Nonuniform data =====
-The
+Assuming uniformity of the integer keys is quite a big deal. Try improving your solution to sort keys with any distribution. You can use the exponentially distributed data available here:
-After solving
 ^ Path ^ Size ^
@@ Line 19: / Line 18: @@
 | /home/straka/hadoop/example-inputs/nonuniform-medium | 160MB |
 | /home/straka/hadoop/example-inputs/nonuniform-large | 797MB |
-After you
+Assume we want to produce //r// output files. One of the solutions is to perform two Hadoop jobs:
+  - Go through the data and sample only a small fraction of the keys. As there are not so many of them, they can fit in one reducer.
+  - Find best //r-1// integer separators using the sampled data.
+  - Run the second pass, using the separators to guide the partitioning.

Institute of Formal and Applied Linguistics Wiki