This is an old revision of the document!

MapReduce Tutorial: Implementing iterative MapReduce jobs faster using All-Reduce

Implementing an iterative computation by running a separate Hadoop job for every iteration is usually not very efficient (although it is fault tolerant).

If we have enough machines that all input data fits into memory, we can implement iterative computation like this:

start several machines, each reads a portion of input data
when all the machines are ready, for each iteration
1. locally compute as much as is possible
2. communicate with other machines and combine the results to complete this iteration

There are many ways how the machines communicate with one another to finish an iteration. On of the possible implementations is AllReduce operation.

An AllReduce operation does the following:

every machine provides a value
values from all the machines are combined using specified operation (e.g., addition, minimum, maximum) into one resulting value
the resulting value is distributed to all the machines

It is possible for each machine to provide more than value. In that case, all machines must provide the same number of values, the corresponding values get reduced in parallel and every machine gets the result of all the reductions.

Hadoop AllReduce implementation

An implementation of AllReduce operation is provided in our Hadoop framework. The implementation is in theory independent of Hadoop (it could be implemented using e.g. an SGE array job), but we use Hadoop framework as it provides features we would have to implement ourselves.

A Hadoop AllReduce job is implemented using a mapper only. The mapper should read the input data during map operation without providing any output. After all input data is read, cooperate function is called. As the name suggests, it is executed by all the machines in a synchronized fashion. In this method, the following allReduce methods can be called:

double allReduce(Context, double value, int reduce_op) – each machine provides one value, which is reduced across all the machines and the result is returned to all machines.
void allReduce(Context, double[] values, int reduce_op) – each machine provides multiple values. The reduced values are stored in the original array in every machine.
void allReduce(Context, double[][] values, int reduce_op) – each machine provides multiple values. The reduced values are stored in the original array in every machine.

The reduce_op can currently be one on:

REDUCE_ADD
REDUCE_MIN
REDUCE_MAX

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

MapReduce Tutorial: Implementing iterative MapReduce jobs faster using All-Reduce

Hadoop AllReduce implementation