====== Large data processing using MapReduce ====== For an introduction, it is best to read the [[http://fox.auryn.cz/mr/original_paper_dean04.pdf|original paper]]. There are also Czech [[http://fox.auryn.cz/mr/slides_czech_2009.pdf|slides]] (up to slide 45). There are nice slides from the three-day course available at [[http://sites.google.com/site/mriap2008/lectures]]. I would suggest to start with http://sites.google.com/site/mriap2008/intro_to_mapreduce.pdf . Now is good time to solve the following exercises: * create a list of unique words present in a given text * count all bigrams present in a given text * count all n-grams for all n <= N in a given text * with what probability is a word capitalized * given a large corpus, find all undiacritized forms of words present in the corpus and for every such form, compute the most probable diacritization * create an index: given many URL + their text, create for each word a list of URLs whose text contain this word. For each such URL, produce an ascending list of positions of this word in the document. * implement iterative k-means algorithm The following slides discuss solutions to various problems using MR: * http://sites.google.com/site/mriap2008/what_is_mapreduce.pdf * http://sites.google.com/site/mriap2008/word_context_enthropy.pdf * http://sites.google.com/site/mriap2008/hadoop_and_k_means.pdf pages 23-30 * http://sites.google.com/site/mriap2008/not_everything_is_nail.pdf (problems difficult for MR) There is also a paper about implementing various machine learning algorithms (SVM, EM, Bayes, etc.) using MapReduce on multicore, which is applicable also for distributed computations: [[http://fox.auryn.cz/mr/machine_learning_using_mr_nips06.pdf]]. ---- ===== Code Template ===== There is also available code template for installing and executing hadoop on ufal workstations. '' wget 'http://ufallab.ms.mff.cuni.cz/~majlis/mapreduce-tutorial.tar.gz' tar -xzf mapreduce-tutorial.tar.gz cd mapreduce-tutorial make '' This template is tested in UFAL environment. If you plan to use this code outside, then you should fix the first two lines in the Makefile. It also contains Eclipse project with set up paths.