Differences

This shows you the differences between two versions of the page.

--- courses:mapreduce-tutorial:step-13 [2012/01/25 22:53]
straka
+++ courses:mapreduce-tutorial:step-13 [2012/01/26 23:13]
straka
@@ Line 3: / Line 3: @@
 You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format:
 ^ Path ^ Size ^
-| /home/straka/hadoop/example-inputs/numbers-small | 3MB |
+| /net/projects/hadoop/examples/inputs/numbers-small | 3MB |
-| /home/straka/hadoop/example-inputs/numbers-medium | 184MB |
+| /net/projects/hadoop/examples/inputs/numbers-medium | 184MB |
-| /home/straka/hadoop/example-inputs/numbers-large | 916MB |
+| /net/projects/hadoop/examples/inputs/numbers-large | 916MB |
 You can assume that the integers are uniformly distributed.
@@ Line 12: / Line 12: @@
 ===== Nonuniform data =====
-The
+Assuming uniformity of the integer keys is quite a big deal. Try improving your solution to sort keys with any distribution. You can use the exponentially distributed data available here:
-After solving
 ^ Path ^ Size ^
-| /home/straka/hadoop/example-inputs/nonuniform-small | 3MB |
+| /net/projects/hadoop/examples/inputs/nonuniform-small | 3MB |
-| /home/straka/hadoop/example-inputs/nonuniform-medium | 160MB |
+| /net/projects/hadoop/examples/inputs/nonuniform-medium | 160MB |
-| /home/straka/hadoop/example-inputs/nonuniform-large | 797MB |
+| /net/projects/hadoop/examples/inputs/nonuniform-large | 797MB |
-After you
+Assume we want to produce //r// output files. One of the solutions is to perform two Hadoop jobs:
+  - Go through the data and sample only a small fraction of the keys. As there are not so many of them, they can fit in one reducer.
+  - Find best //r-1// integer separators using the sampled data.
+  - Run the second pass, using the separators to guide the partitioning.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences