[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
courses:mapreduce-tutorial:step-5 [2012/01/24 19:04]
straka vytvořeno
courses:mapreduce-tutorial:step-5 [2012/01/31 15:56] (current)
straka
Line 1: Line 1:
-====== MapReduce Tutorial : ======+====== MapReduce Tutorial : Basic reducer ====== 
 + 
 +The interesting part of a Hadoop job is the //reducer// -- after all mappers produce the (key, value) pairs, for every unique key and all its values a ''reduce'' function is called. The ''reduce'' function can output (key, value) pairs, which are written to disk. 
 + 
 +The ''reduce'' is similar to ''map'', but instead of one value it gets an iterator (instance of ''Hadoop::Runner::ValueIterator''), which enumerates all values associated with the key: 
 + 
 +<file perl> 
 +package My::Mapper; 
 +use Moose; 
 +with 'Hadoop::Mapper'; 
 + 
 +sub map { 
 +  my ($self, $key, $value, $context) = @_; 
 + 
 +  $context->write($key, $value); 
 +
 + 
 +package My::Reducer; 
 +use Moose; 
 +with 'Hadoop::Reducer'; 
 + 
 +sub reduce { 
 +  my ($self, $key, $values, $context) = @_; 
 + 
 +  while ($values->next) { 
 +    $context->write($key, $values->value); 
 +  } 
 +
 + 
 +package main; 
 +use Hadoop::Runner; 
 + 
 +my $runner = Hadoop::Runner->new( 
 +  mapper => My::Mapper->new(), 
 +  reducer => My::Reducer->new()); 
 + 
 +$runner->run(); 
 +</file> 
 + 
 +As before, Hadoop silently handles failures. It can happen that even a successfully finished mapper needs to be executed again -- if the machine, where its output data were stored, gets disconnected from the network. 
 + 
 +===== Types of keys and values ===== 
 + 
 +Currently in the Perl API, the keys and values are both strings, which are stored and loaded using UTF-8 format and compared lexicographically. If you need more complex structures, you have to serialize and deserialize them by yourselves. 
 + 
 +The Java API offers a wide range of types, including user-defined types, to be used for keys and values. 
 + 
 +===== Exercise 1 ===== 
 + 
 +Run a Hadoop job on ''/home/straka/wiki/cs-text-small'', which counts occurrences of every word in the article texts. You can download the template {{:courses:mapreduce-tutorial:step-5-exercise1.txt|step-5-exercise1.pl}}  and execute it. 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-exercise1.txt' -O 'step-5-exercise1.pl' 
 +  # NOW EDIT THE FILE 
 +  # $EDITOR step-5-exercise1.pl 
 +  rm -rf step-5-out-ex1; perl step-5-exercise1.pl /home/straka/wiki/cs-text-medium/ step-5-out-ex1 
 +  less step-5-out-ex1/part-* 
 + 
 +==== Solution ==== 
 +You can also download the solution {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-5-solution1.pl}} and check the correct output. 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-5-solution1.pl' 
 +  # NOW VIEW THE FILE 
 +  # $EDITOR step-5-solution1.pl 
 +  rm -rf step-5-out-sol1; perl step-5-solution1.pl /home/straka/wiki/cs-text-medium/ step-5-out-sol1 
 +  less step-5-out-sol1/part-* 
 + 
 + 
 +===== Exercise 2 ===== 
 + 
 +Run a Hadoop job on ''/home/straka/wiki/cs-text-small'', which generates an inverted index. Inverted index contains for each word all its //occurrences//, where each occurrence is pair (article of occurrence, position of occurrence). You can download the template {{:courses:mapreduce-tutorial:step-5-exercise2.txt|step-5-exercise2.pl}}  and execute it. 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-exercise2.txt' -O 'step-5-exercise2.pl' 
 +  # NOW EDIT THE FILE 
 +  # $EDITOR step-5-exercise2.pl 
 +  rm -rf step-5-out-ex2; perl step-5-exercise2.pl /home/straka/wiki/cs-text-small/ step-5-out-ex2 
 +  less step-5-out-ex2/part-* 
 + 
 +==== Solution ==== 
 +You can also download the solution {{:courses:mapreduce-tutorial:step-5-solution2.txt|step-5-solution2.pl}} and check the correct output. 
 +  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution2.txt' -O 'step-5-solution2.pl' 
 +  # NOW VIEW THE FILE 
 +  # $EDITOR step-5-solution2.pl 
 +  rm -rf step-5-out-sol2; perl step-5-solution2.pl /home/straka/wiki/cs-text-small/ step-5-out-sol2 
 +  less step-5-out-sol2/part-* 
 + 
 +---- 
 + 
 +<html> 
 +<table style="width:100%"> 
 +<tr> 
 +<td style="text-align:left; width: 33%; "></html>[[step-4|Step 4]]: Counters.<html></td> 
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> 
 +<td style="text-align:right; width: 33%; "></html>[[step-6|Step 6]]: Running on cluster.<html></td> 
 +</tr> 
 +</table> 
 +</html>

[ Back to the navigation ] [ Back to the content ]