ufal wiki courses:mapreduce-tutorial

courses:mapreduce-tutorial:hadoop-job-overview

Anonymous (anonymous@undisclosed.example.com) — 2012-02-06T06:11:49+00:00

MapReduce Tutorial : Hadoop job overview A regular Hadoop job consists of: * [required] a mapper -- processes input (key, value) pairs, produces (key, value) pairs. There can be multiple mappers: each file is divided into (by default 32MB) splits and each split is processed by one mapper. Script

courses:mapreduce-tutorial:if-things-go-wrong

Anonymous (anonymous@undisclosed.example.com) — 2012-02-06T13:55:23+00:00

MapReduce Tutorial : If things go wrong A lot can go wrong in the process of creating cluster and submitting the Hadoop job: * Hadoop::Runner.pm module not found: The Perl Hadoop package is not configured, see Setting the environment. * ipc.Client: Retrying connect to server: IP_ADDRESS:PORT. Already tried ? time(s)

courses:mapreduce-tutorial:introduction

Anonymous (anonymous@undisclosed.example.com) — 2012-01-15T22:02:11+00:00

Large data processing using MapReduce For an introduction, it is best to read the original paper. There are also Czech slides (up to slide 45). There are nice slides from the three-day course available at . I would suggest to start with . Now is good time to solve the following exercises:

courses:mapreduce-tutorial:making-your-job-configurable

Anonymous (anonymous@undisclosed.example.com) — 2012-02-05T21:45:21+00:00

MapReduce Tutorial : Making your job configurable Sometimes it is desirable for a Hadoop job to be configurable without recompiling/rewriting the source. This can be achieved: * Java: use Hadoop properties: * when running the job, use /net/projects/hadoop/bin/hadoop job.jar -Dname1=value1 -Dname2=value2

courses:mapreduce-tutorial:managing-a-hadoop-cluster

Anonymous (anonymous@undisclosed.example.com) — 2013-02-08T15:25:11+00:00

MapReduce Tutorial : Managing a Hadoop cluster Hadoop clusters can be created and stopped dynamically, using the SGE cluster. A Hadoop cluster consists of one jobtracker (master of the cluster) and multiple tasktrackers. The cluster is identified by its jobtracker. The jobtracker listens on two ports

courses:mapreduce-tutorial:perl-api

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:38:14+00:00

MapReduce Tutorial - Perl API Hadoop::Runner package Hadoop::Runner; use Moose; has 'mapper' => (does => 'Hadoop::Mapper', required => 1); has 'reducer' => (does => 'Hadoop::Reducer'); has 'combiner' => (does => 'Hadoop::Reducer'); has 'partitioner' => (does => 'Hadoop::Partitioner'); has 'input_format' => (isa => 'InputFormat', default => 'TextInputFormat'); has 'output_format' => (isa => 'OutputFormat', default => 'TextOutputFormat'); has 'output_compression' => (isa => 'Bool', default =>…

courses:mapreduce-tutorial:running-jobs

Anonymous (anonymous@undisclosed.example.com) — 2013-02-08T14:33:33+00:00

MapReduce Tutorial : Running jobs The input of a Hadoop job is either a file, or a directory. In latter case all files in the directory are processed. The output of a Hadoop job must be a directory, which does not exist. Running jobs Command Run

courses:mapreduce-tutorial:step-1

Anonymous (anonymous@undisclosed.example.com) — 2012-01-30T15:25:53+00:00

MapReduce Tutorial : Setting the environment Requirements The tutorial expects you to be logged to a computer in the UFAL cluster and be able to submit jobs using SGE. In this environment, Hadoop is installed in /SGE/HADOOP/active. To use the Perl

courses:mapreduce-tutorial:step-2

Anonymous (anonymous@undisclosed.example.com) — 2012-01-29T16:03:53+00:00

MapReduce tutorial : Input and output format, testing data. The MapReduce framework is frequently using (key, value) pairs. These pairs can be read from a file and written to a file and there are several formats available. Input formats * TextInputFormat

courses:mapreduce-tutorial:step-3

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:40:29+00:00

MapReduce Tutorial : Basic mapper The simplest Hadoop job consists of a mapper only. The input data is divided in several parts, every processed by an independent mapper, and the results are collected in one directory, one file per mapper. The Hadoop framework silently handles failures. If a mapper task fails, another is executed and the input of the failed attempt is discarded.

courses:mapreduce-tutorial:step-4

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:40:39+00:00

MapReduce Tutorial : Counters Sometimes it is useful to count events differently than outputting them as (key, value) pairs. For that reason Hadoop offers simple counter framework. Hadoop maintains a collection of pre-defined and user-defined counters. Every counter is identified by its group name and counter name. The group name and counter name is an arbitrary string

courses:mapreduce-tutorial:step-5

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T15:56:03+00:00

MapReduce Tutorial : Basic reducer The interesting part of a Hadoop job is the reducer -- after all mappers produce the (key, value) pairs, for every unique key and all its values a reduce function is called. The reduce function can output (key, value) pairs, which are written to disk.

courses:mapreduce-tutorial:step-6

Anonymous (anonymous@undisclosed.example.com) — 2012-02-06T13:55:37+00:00

MapReduce Tutorial : Running on cluster Probably the most important feature of MapReduce is to run computations distributively. So far all our Hadoop jobs were executed locally. But all of them can be executed on multiple machines. It suffices to add parameter

courses:mapreduce-tutorial:step-7

Anonymous (anonymous@undisclosed.example.com) — 2013-02-08T14:36:41+00:00

MapReduce Tutorial : Dynamic Hadoop cluster for several computations When multiple Hadoop jobs should be executed, it is better to reuse the cluster instead of allocating a new one for every computation. A cluster can be created using /net/projects/hadoop/bin/hadoop-cluster -c number_of_machines -w sec_to_wait_after_all_jobs_completed

courses:mapreduce-tutorial:step-8

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T15:55:44+00:00

MapReduce Tutorial : Multiple mappers, reducers and partitioning A Hadoop job, which is expected to run on many computers at the same time, need to use multiple mappers and reducers. It is possible to control these numbers to some degree. Multiple mappers

courses:mapreduce-tutorial:step-9

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:42:06+00:00

MapReduce Tutorial : Hadoop properties We have controlled the Hadoop jobs using the Perl API so far, which is quite limited. The Hadoop itself uses many configuration options. The options can be set on command line using the -Dname=value syntax: perl script.pl [-jt cluster_master | -c cluster_size [-w sec_to_wait]] [-r number_of_reducers] [-Dname=value -Dname=value ...] input output_path

courses:mapreduce-tutorial:step-10

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:38:39+00:00

MapReduce Tutorial : Combiners Sometimes the reduce is a binary operation, which is associative and commutative, e.g. +. In that case it is inefficient to produce all the (key, value) pairs in the mappers and send them through the network. Instead, reducer can be executed right after the map, on

courses:mapreduce-tutorial:step-11

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:39:03+00:00

MapReduce Tutorial : Initialization and cleanup of MR tasks, performance of combiners During the mapper or reducer task execution the following steps take place: * Perl script is executed in the current directory, ie. in the directory where the job was executed / submitted from.

courses:mapreduce-tutorial:step-12

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T09:39:14+00:00

MapReduce Tutorial : Additional output from mappers and reducers Sometimes it would be useful to create output files manually in reducers -- either multiple files are needed per reducer, or a specific file format is desired. Problem is that Hadoop framework can spawn several task attempts for the same reducer task

courses:mapreduce-tutorial:step-13

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T15:54:39+00:00

MapReduce Tutorial : Exercise - sorting You are given data consisting of (31-bit integer, string data) pairs. These are available in plain text format: Path Size /net/projects/hadoop/examples/inputs/numbers-small 3MB /net/projects/hadoop/examples/inputs/numbers-medium

courses:mapreduce-tutorial:step-14

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T16:08:25+00:00

MapReduce Tutorial : Exercise - N-gram language model For a given N create a simple N-gram language model. You can start experimenting on the following data: Path Size /home/straka/wiki/cs-seq-medium 8MB /home/straka/wiki/cs-seq 82MB /home/straka/wiki/en-seq

courses:mapreduce-tutorial:step-15

Anonymous (anonymous@undisclosed.example.com) — 2012-01-29T16:40:12+00:00

MapReduce Tutorial : K-means clustering Implement the K-means clustering algorithm. You can use the following data: Path Number of points Number of dimensions Number of clusters /net/projects/hadoop/examples/inputs/points-small 10000 50 50 /net/projects/hadoop/examples/inputs/points-medium

courses:mapreduce-tutorial:step-16

Anonymous (anonymous@undisclosed.example.com) — 2012-02-06T13:29:50+00:00

MapReduce Tutorial: Implementing iterative MapReduce jobs faster using All-Reduce Implementing an iterative computation by running a separate Hadoop job for every iteration is usually not very efficient (although it is fault tolerant). If we have enough machines that all input data fits into memory, we can implement iterative computation like this:

courses:mapreduce-tutorial:step-21

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T11:18:59+00:00

MapReduce Tutorial : Preparing the environment To use the Hadoop Java API, you must be able to compile the Java sources with the Hadoop library. An easy way is to use a prepared Makefile: * Create a directory for the Java sources. * Create a Makefile

courses:mapreduce-tutorial:step-22

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T10:46:27+00:00

MapReduce Tutorial : Setting Eclipse This is not well tested. If you do not like VIM, you can try using Eclipse as a Java editor. You should * Download /SGE/HADOOP/active/hadoop-core-1.0.1-SNAPSHOT.jar * Download the directory /net/projects/hadoop/java-extensions/classes

courses:mapreduce-tutorial:step-23

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T14:33:37+00:00

MapReduce Tutorial : Predefined formats and types Currently there are two different Java APIs: * org.apache.hadoop.mapred: This is the original API, which is currently deprecated. * org.apache.hadoop.mapreduce: This is the new API, which we will be using in this tutorial. The only problem is that some library classes have not yet been converted to use the new

courses:mapreduce-tutorial:step-24

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T16:25:08+00:00

MapReduce Tutorial : Mappers, running Java Hadoop jobs, counters We start by going through a simple Hadoop job with Mapper only. A mapper which processes (key, value) pairs of types (Kin, Vin) and produces (key, value) pairs of types (Kout, Vout) must be a subclass of

courses:mapreduce-tutorial:step-25

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T15:12:37+00:00

MapReduce Tutorial : Reducers, combiners and partitioners. A reducer in a Hadoop job must be a subclass of Reducer. As in the Perl API, any reducer can be used as a combiner. Here is a Hadoop job computing the number of occurrences of all words: import java.io.IOException; import java.util.StringTokenizer; import org.apache.hadoop.conf.*; import org.apache.hadoop.fs.Path; import org.apache.hadoop.io.*; import org.apache.hadoop.mapreduce.*; import org.apache.hadoop.map…

courses:mapreduce-tutorial:step-26

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T14:41:17+00:00

MapReduce Tutorial : Compression and job configuration Compression The output files can be compressed using FileOutputFormat.setCompressOutput(job, true); The default compression format is deflate -- raw Zlib compression. Several other compression formats can be selected:

courses:mapreduce-tutorial:step-27

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T14:39:06+00:00

MapReduce Tutorial : Running multiple Hadoop jobs in one source file The Java API offers possibility to submit multiple Hadoop job in one source file. A job can be submitted either using * job.waitForCompletion -- the job is submitted and the method waits for it to finish (successfully or not).

courses:mapreduce-tutorial:step-28

Anonymous (anonymous@undisclosed.example.com) — 2012-02-05T19:10:21+00:00

MapReduce Tutorial : Custom data types An important feature of the Java API is that custom data and format types can be provided. In this step we implement two custom data types. BERIntWritable We want to implement BERIntWritable, which is an int stored in the format of

courses:mapreduce-tutorial:step-29

Anonymous (anonymous@undisclosed.example.com) — 2012-02-05T19:14:16+00:00

MapReduce Tutorial : Custom sorting and grouping comparators. Custom sorting comparator The keys are sorted before processed by a reducer, using a Raw comparator. The default comparator uses the compareTo method provided by the key type, which is a subclass of WritableComparable. Consider for example the following

courses:mapreduce-tutorial:step-30

Anonymous (anonymous@undisclosed.example.com) — 2012-01-31T12:41:17+00:00

MapReduce Tutorial : Custom input formats Every custom format reading keys of type K and values of type V must subclass InputFormat. Usually it is easier to subclass FileInputFormat -- the file listing and splitting is then solved by the FileInputFormat itself. FileAsPathInputFormat

courses:mapreduce-tutorial:step-31

Anonymous (anonymous@undisclosed.example.com) — 2012-02-06T14:52:49+00:00