Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
courses:mapreduce-tutorial:step-24 [2012/01/30 15:38] majlis |
courses:mapreduce-tutorial:step-24 [2012/01/31 16:25] (current) dusek |
====== MapReduce Tutorial : Mappers, running Java Hadoop jobs ====== | ====== MapReduce Tutorial : Mappers, running Java Hadoop jobs, counters ====== |
| |
We start by going through a simple Hadoop job with Mapper only. | We start by going through a simple Hadoop job with Mapper only. |
===== Running the job ===== | ===== Running the job ===== |
The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner: | The official way of running Hadoop jobs is to use the ''/SGE/HADOOP/active/bin/hadoop'' script. Jobs submitted through this script can be configured using Hadoop properties only. Therefore a wrapper script is provided, with similar options as the Perl API runner: |
* ''net/projects/hadoop/bin/hadoop [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- executes the given job locally in a single thread. It is useful for debugging. | * ''net/projects/hadoop/bin/hadoop job.jar [-Dname=value -Dname=value ...] input output_path'' -- executes the given job locally in a single thread. It is useful for debugging. |
* ''net/projects/hadoop/bin/hadoop -jt cluster_master [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- submits the job to given ''cluster_master''. | * ''net/projects/hadoop/bin/hadoop job.jar -jt cluster_master [-r number_of_reducers] [-Dname=value -Dname=value ...] input output_path'' -- submits the job to given ''cluster_master''. |
* ''net/projects/hadoop/bin/hadoop -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] job.jar [generic Hadoop properties] input_path output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops. | * ''net/projects/hadoop/bin/hadoop job.jar -c number_of_machines [-w secs_to_wait_after_job_finishes] [-r number_of_reducers] [-Dname=value -Dname=value ...] input output_path'' -- creates a new cluster with specified number of machines, which executes given job, and then waits for specified number of seconds before it stops. |
| |
===== Exercise ===== | ===== Exercise 1 ===== |
Download the ''MapperOnlyHadoopJob.java'', compile it and run it using | Download the ''MapperOnlyHadoopJob.java'', compile it and run it using |
wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-24?codeblock=1' -O 'MapperOnlyHadoopJob.java' | wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_export/code/courses:mapreduce-tutorial:step-24?codeblock=1' -O 'MapperOnlyHadoopJob.java' |
make -f /net/projects/hadoop/java/Makefile MapperOnlyHadoopJob.jar | make -f /net/projects/hadoop/java/Makefile MapperOnlyHadoopJob.jar |
rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop -r 0 MapperOnlyHadoopJob.jar /home/straka/wiki/cs-text-small step-24-out-sol | rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop MapperOnlyHadoopJob.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol |
less step-24-out-sol/part-* | less step-24-out-sol/part-* |
| |
* When not specifying ''-r 0'' (i.e., using ''-r 1'' with ''IdentityReducer''), the job produces the same (key, value) pairs. But this time they are in one output file, sorted according to the key. Of course, the job runs slower in this case. | * When not specifying ''-r 0'' (i.e., using ''-r 1'' with ''IdentityReducer''), the job produces the same (key, value) pairs. But this time they are in one output file, sorted according to the key. Of course, the job runs slower in this case. |
| |
| ===== Counters ===== |
| |
| As in the Perl API, a mapper (or a reducer) can increment various counters by using ''context.getCounter("Group", "Name").increment(value)'': |
| <code java> |
| public void map(Text key, Text value, Context context) throws IOException, InterruptedException { |
| ... |
| context.getCounter("Group", "Name").increment(value); |
| ... |
| } |
| </code> |
| The ''getCounter'' method returns a [[http://hadoop.apache.org/common/docs/r1.0.0/api/org/apache/hadoop/mapreduce/Counter.html|Counter]] object, so if a counter is incremented frequently, the ''getCounter'' method can be called only once: |
| <code java> |
| public void map(Text key, Text value, Context context) throws IOException, InterruptedException { |
| ... |
| Counter words = context.getCounter("Mapper", "Number of words"); |
| for (String word : value.toString().split("\\W+")) { |
| ... |
| words.increment(1); |
| } |
| } |
| </code> |
| |
| ===== Example 2 ===== |
| |
| Run a Hadoop job on /home/straka/wiki/cs-text-small, which filters the documents so that only three-letter words remain. Also use counters to count the histogram of words lengths and to compute the percentage of three letter words in the documents. You can download the template {{:courses:mapreduce-tutorial:step-24.txt|ThreeLetterWords.java}} and execute it. |
| |
| wget --no-check-certificate 'https://wiki.ufal.ms.mff.cuni.cz/_media/courses:mapreduce-tutorial:step-24.txt' -O 'ThreeLetterWords.java' |
| # NOW VIEW THE FILE |
| # $EDITOR ThreeLetterWords.java |
| make -f /net/projects/hadoop/java/Makefile ThreeLetterWords.jar |
| rm -rf step-24-out-sol; /net/projects/hadoop/bin/hadoop ThreeLetterWords.jar -r 0 /home/straka/wiki/cs-text-small step-24-out-sol |
| less step-24-out-sol/part-* |
| |
---- | ---- |