[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
courses:mapreduce-tutorial:step-24 [2012/01/27 22:16]
straka
courses:mapreduce-tutorial:step-24 [2012/01/30 15:38]
majlis
Line 3: Line 3:
 We start by going through a simple Hadoop job with Mapper only. We start by going through a simple Hadoop job with Mapper only.
  
-A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''.+//mapper// which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Mapper.html|Mapper<Kin, Vin, Kout, Vout>]]. In our case, the mapper is subclass of ''Mapper<Text, Text, Text, Text>''.
  
 The mapper must define a ''map'' method and may provide ''setup'' and ''context'' method: The mapper must define a ''map'' method and may provide ''setup'' and ''context'' method:
Line 80: Line 80:
 </file> </file>
  
-Remarks: +==== Remarks ==== 
-  * The filename //must// be the same as the name of the class -- this is enforced by Java compiler.+  * The filename //must// be the same as the name of the top-level class -- this is enforced by Java compiler. But the top-level class can contain any number of nested classes.
   * In one class multiple jobs can be submitted, either in sequence or in parallel.   * In one class multiple jobs can be submitted, either in sequence or in parallel.
   * A mismatch of types is usually detected by the compiler, but sometimes it is detected only at runtime. If that happens, an exception is raised and the program crashes. For example, default key output class it ''LongWritable'' -- if ''Text'' was not specified, the program would crash.   * A mismatch of types is usually detected by the compiler, but sometimes it is detected only at runtime. If that happens, an exception is raised and the program crashes. For example, default key output class it ''LongWritable'' -- if ''Text'' was not specified, the program would crash.
 +  * **VIM users**: The code completion plugin does not complete the ''context'' variable. That is because it does not understand that ''Context'' is used as an abbreviation for ''MapContext<Text, Text, Text, Text>''. If the type ''MapContext<Text, Text, Text, Text>'' is used instead of ''Context'', the code compiles and code completion on ''context'' works.
  
 ===== Running the job ===== ===== Running the job =====
Line 93: Line 94:
 ===== Exercise ===== ===== Exercise =====
 Download the ''MapperOnlyHadoopJob.java'', compile it and run it using Download the ''MapperOnlyHadoopJob.java'', compile it and run it using
-  /net/projects/hadoop/bin/hadoop -r 0 MapperOnlyHadoopJob.jar /home/straka/wiki/cs-text-small outdir+  wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-24?codeblock=1' -O 'MapperOnlyHadoopJob.java' 
 +  make -f /net/projects/hadoop/java/Makefile MapperOnlyHadoopJob.jar 
 +  rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop -r 0 MapperOnlyHadoopJob.jar /home/straka/wiki/cs-text-small step-24-out-sol 
 +  less step-24-out-sol/part-*
  
-Mind the ''-r 0'' switch -- specifying ''-r 0'' disable the reducer. If the switch ''-r 0'' was not given, one default reducer ''IdentityReducer'' would be used. The ''IdentityReducer'' outputs every (key, value) pair it is given. +Mind the ''-r 0'' switch -- specifying ''-r 0'' disable the reducer. If the switch ''-r 0'' was not given, one reducer of default type ''IdentityReducer'' would be used. The ''IdentityReducer'' outputs every (key, value) pair it is given. 
-  * When using ''-r 0', the job runs faster, as the mappers write the output directly to disk. Buth there are as many output files as mappers and the (key, value) pairs are stored in no special order. +  * When using ''-r 0'', the job runs faster, as the mappers write the output directly to disk. Buth there are as many output files as mappers and the (key, value) pairs are stored in no special order. 
-  * When not specifying ''-r 0'' (i.e., using ''-r 1'' with ''IdentityReducer''), the job produces the same (key, value) pairs. But this time they are in one output file, sorted by the key. Of course, the job runs slower in this case.+  * When not specifying ''-r 0'' (i.e., using ''-r 1'' with ''IdentityReducer''), the job produces the same (key, value) pairs. But this time they are in one output file, sorted according to the key. Of course, the job runs slower in this case. 
 + 
 + 
 + 
 +---- 
 + 
 +<html> 
 +<table style="width:100%"> 
 +<tr> 
 +<td style="text-align:left; width: 33%; "></html>[[step-23|Step 23]]: Predefined formats and types.<html></td> 
 +<td style="text-align:center; width: 33%; "></html>[[.|Overview]]<html></td> 
 +<td style="text-align:right; width: 33%; "></html>[[step-25|Step 25]]: Reducers, combiners and partitioners.<html></td> 
 +</tr> 
 +</table> 
 +</html>
  

[ Back to the navigation ] [ Back to the content ]