Large data processing using MapReduce
For an introduction, it is best to read the original paper.
There are also Czech slides (up to slide 45).
There are nice slides from the three-day course available at http://sites.google.com/site/mriap2008/lectures.
I would suggest to start with http://sites.google.com/site/mriap2008/intro_to_mapreduce.pdf .
Now is good time to solve the following exercises:
- create a list of unique words present in a given text
- count all bigrams present in a given text
- count all n-grams for all n ⇐ N in a given text
- with what probability is a word capitalized
- given a large corpus, find all undiacritized forms of words present in the corpus and for every such form, compute the most probable diacritization
- create an index: given many URL + their text, create for each word a list of URLs whose text contain this word. For each such URL, produce an ascending list of positions of this word in the document.
- implement iterative k-means algorithm
The following slides discuss solutions to various problems using MR:
- http://sites.google.com/site/mriap2008/not_everything_is_nail.pdf (problems difficult for MR)
There is also a paper about implementing various machine learning algorithms (SVM, EM, Bayes, etc.) using MapReduce on multicore, which is applicable also for distributed computations: http://fox.auryn.cz/mr/machine_learning_using_mr_nips06.pdf.
Code Template
There is also available code template for installing and executing hadoop on ufal workstations.
wget 'http://ufallab.ms.mff.cuni.cz/~majlis/mapreduce-tutorial.tar.gz'
tar -xzf mapreduce-tutorial.tar.gz
cd mapreduce-tutorial
make
This template is tested in UFAL environment. If you plan to use this code outside, then you should fix the first two lines in the Makefile. It also contains Eclipse project with set up paths.