[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:mapreduce-tutorial:step-5 [2012/01/28 12:52]
majlis Added links to previous and next chapter.
courses:mapreduce-tutorial:step-5 [2012/01/31 15:56] (current)
straka
Line 3: Line 3:
 The interesting part of a Hadoop job is the //reducer// -- after all mappers produce the (key, value) pairs, for every unique key and all its values a ''reduce'' function is called. The ''reduce'' function can output (key, value) pairs, which are written to disk. The interesting part of a Hadoop job is the //reducer// -- after all mappers produce the (key, value) pairs, for every unique key and all its values a ''reduce'' function is called. The ''reduce'' function can output (key, value) pairs, which are written to disk.
  
-The ''reduce'' is similar to ''map'', but instead of one value it gets an iterator, which enumerates all values associated with the key:+The ''reduce'' is similar to ''map'', but instead of one value it gets an iterator (instance of ''Hadoop::Runner::ValueIterator''), which enumerates all values associated with the key:
  
 <file perl> <file perl>
-package Mapper;+package My::Mapper;
 use Moose; use Moose;
 with 'Hadoop::Mapper'; with 'Hadoop::Mapper';
Line 16: Line 16:
 } }
  
-package Reducer;+package My::Reducer;
 use Moose; use Moose;
 with 'Hadoop::Reducer'; with 'Hadoop::Reducer';
Line 28: Line 28:
 } }
  
-package Main;+package main;
 use Hadoop::Runner; use Hadoop::Runner;
  
 my $runner = Hadoop::Runner->new( my $runner = Hadoop::Runner->new(
-  mapper => Mapper->new(), +  mapper => My::Mapper->new(), 
-  reducer => Reducer->new());+  reducer => My::Reducer->new());
  
 $runner->run(); $runner->run();
Line 39: Line 39:
  
 As before, Hadoop silently handles failures. It can happen that even a successfully finished mapper needs to be executed again -- if the machine, where its output data were stored, gets disconnected from the network. As before, Hadoop silently handles failures. It can happen that even a successfully finished mapper needs to be executed again -- if the machine, where its output data were stored, gets disconnected from the network.
 +
 +===== Types of keys and values =====
 +
 +Currently in the Perl API, the keys and values are both strings, which are stored and loaded using UTF-8 format and compared lexicographically. If you need more complex structures, you have to serialize and deserialize them by yourselves.
 +
 +The Java API offers a wide range of types, including user-defined types, to be used for keys and values.
  
 ===== Exercise 1 ===== ===== Exercise 1 =====
Line 44: Line 50:
 Run a Hadoop job on ''/home/straka/wiki/cs-text-small'', which counts occurrences of every word in the article texts. You can download the template {{:courses:mapreduce-tutorial:step-5-exercise1.txt|step-5-exercise1.pl}}  and execute it. Run a Hadoop job on ''/home/straka/wiki/cs-text-small'', which counts occurrences of every word in the article texts. You can download the template {{:courses:mapreduce-tutorial:step-5-exercise1.txt|step-5-exercise1.pl}}  and execute it.
   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-exercise1.txt' -O 'step-5-exercise1.pl'   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-exercise1.txt' -O 'step-5-exercise1.pl'
-  rm -rf step-5-out-ex1; perl step-5-exercise1.pl run /home/straka/wiki/cs-text-medium/ step-5-out-ex1+  # NOW EDIT THE FILE 
 +  # $EDITOR step-5-exercise1.pl 
 +  rm -rf step-5-out-ex1; perl step-5-exercise1.pl /home/straka/wiki/cs-text-medium/ step-5-out-ex1
   less step-5-out-ex1/part-*   less step-5-out-ex1/part-*
  
Line 50: Line 58:
 You can also download the solution {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-5-solution1.pl}} and check the correct output. You can also download the solution {{:courses:mapreduce-tutorial:step-5-solution1.txt|step-5-solution1.pl}} and check the correct output.
   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-5-solution1.pl'   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution1.txt' -O 'step-5-solution1.pl'
-  rm -rf step-5-out-sol1; perl step-5-solution1.pl run /home/straka/wiki/cs-text-medium/ step-5-out-sol1+  # NOW VIEW THE FILE 
 +  # $EDITOR step-5-solution1.pl 
 +  rm -rf step-5-out-sol1; perl step-5-solution1.pl /home/straka/wiki/cs-text-medium/ step-5-out-sol1
   less step-5-out-sol1/part-*   less step-5-out-sol1/part-*
  
Line 58: Line 68:
 Run a Hadoop job on ''/home/straka/wiki/cs-text-small'', which generates an inverted index. Inverted index contains for each word all its //occurrences//, where each occurrence is pair (article of occurrence, position of occurrence). You can download the template {{:courses:mapreduce-tutorial:step-5-exercise2.txt|step-5-exercise2.pl}}  and execute it. Run a Hadoop job on ''/home/straka/wiki/cs-text-small'', which generates an inverted index. Inverted index contains for each word all its //occurrences//, where each occurrence is pair (article of occurrence, position of occurrence). You can download the template {{:courses:mapreduce-tutorial:step-5-exercise2.txt|step-5-exercise2.pl}}  and execute it.
   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-exercise2.txt' -O 'step-5-exercise2.pl'   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-exercise2.txt' -O 'step-5-exercise2.pl'
-  rm -rf step-5-out-ex2; perl step-5-exercise2.pl run /home/straka/wiki/cs-text-tiny/ step-5-out-ex2+  # NOW EDIT THE FILE 
 +  # $EDITOR step-5-exercise2.pl 
 +  rm -rf step-5-out-ex2; perl step-5-exercise2.pl /home/straka/wiki/cs-text-small/ step-5-out-ex2
   less step-5-out-ex2/part-*   less step-5-out-ex2/part-*
  
Line 64: Line 76:
 You can also download the solution {{:courses:mapreduce-tutorial:step-5-solution2.txt|step-5-solution2.pl}} and check the correct output. You can also download the solution {{:courses:mapreduce-tutorial:step-5-solution2.txt|step-5-solution2.pl}} and check the correct output.
   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution2.txt' -O 'step-5-solution2.pl'   wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-5-solution2.txt' -O 'step-5-solution2.pl'
-  rm -rf step-5-out-sol2; perl step-5-solution2.pl run /home/straka/wiki/cs-text-tiny/ step-5-out-sol2+  # NOW VIEW THE FILE 
 +  # $EDITOR step-5-solution2.pl 
 +  rm -rf step-5-out-sol2; perl step-5-solution2.pl /home/straka/wiki/cs-text-small/ step-5-out-sol2
   less step-5-out-sol2/part-*   less step-5-out-sol2/part-*
  

[ Back to the navigation ] [ Back to the content ]