Large data processing using MapReduce

For an introduction, it is best to read the original paper.
There are also Czech slides (up to slide 45).

There are nice slides from the three-day course available at http://sites.google.com/site/mriap2008/lectures.
I would suggest to start with http://sites.google.com/site/mriap2008/intro_to_mapreduce.pdf .

Now is good time to solve the following exercises:

create a list of unique words present in a given text
count all bigrams present in a given text
count all n-grams for all n ⇐ N in a given text
with what probability is a word capitalized
given a large corpus, find all undiacritized forms of words present in the corpus and for every such form, compute the most probable diacritization
create an index: given many URL + their text, create for each word a list of URLs whose text contain this word. For each such URL, produce an ascending list of positions of this word in the document.
implement iterative k-means algorithm

The following slides discuss solutions to various problems using MR:

http://sites.google.com/site/mriap2008/what_is_mapreduce.pdf
http://sites.google.com/site/mriap2008/word_context_enthropy.pdf
http://sites.google.com/site/mriap2008/hadoop_and_k_means.pdf pages 23-30
http://sites.google.com/site/mriap2008/not_everything_is_nail.pdf (problems difficult for MR)

There is also a paper about implementing various machine learning algorithms (SVM, EM, Bayes, etc.) using MapReduce on multicore, which is applicable also for distributed computations: http://fox.auryn.cz/mr/machine_learning_using_mr_nips06.pdf.

Code Template

There is also available code template for installing and executing hadoop on ufal workstations.
wget 'http://ufallab.ms.mff.cuni.cz/~majlis/mapreduce-tutorial.tar.gz' tar -xzf mapreduce-tutorial.tar.gz cd mapreduce-tutorial make

This template is tested in UFAL environment. If you plan to use this code outside, then you should fix the first two lines in the Makefile. It also contains Eclipse project with set up paths.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Large data processing using MapReduce

Code Template