[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
courses:mapreduce-tutorial:step-28 [2012/02/05 19:08]
straka
courses:mapreduce-tutorial:step-28 [2012/02/05 19:10] (current)
straka
Line 120: Line 120:
 Imagine you want to create an inverted index. In the index, for each word and document containing the word, all positions of the word in the document have to be stored. Imagine you want to create an inverted index. In the index, for each word and document containing the word, all positions of the word in the document have to be stored.
  
-Create a type ''​DocWithOccurences<Doctype extends WritableComparable>''​ implementing ''​WritableComparable''​. The type:+Create a type ''​DocWithOccurrences<Doctype extends WritableComparable>''​ implementing ''​WritableComparable''​. The type:
   * stores a document of type ''​Doctype''​.   * stores a document of type ''​Doctype''​.
   * stores a list of positions of occurrence. The sequence of length //N// should be stored on disk as number //N// followed by //N// numbers -- positions of occurrence. Type ''​BERIntWritable''​ should be used.   * stores a list of positions of occurrence. The sequence of length //N// should be stored on disk as number //N// followed by //N// numbers -- positions of occurrence. Type ''​BERIntWritable''​ should be used.
   * is comparable, comparing using the ''​Comparable''​ interface od ''​Doctype''​.   * is comparable, comparing using the ''​Comparable''​ interface od ''​Doctype''​.
-  * has methods ''​getDoc'',​ ''​setDoc'',​ ''​getOccurrences'',​ ''​addOccurence'',​ ''​toString''​.+  * has methods ''​getDoc'',​ ''​setDoc'',​ ''​getOccurrences'',​ ''​addOccurrence'',​ ''​toString''​.
  
-Using this type, create an inverted index -- implement a Hadoop job, that for each word creates a list of ''​DocWithOccurences<​Text>''​ containing the documents containing this word, including the occurrences.+Using this type, create an inverted index -- implement a Hadoop job, that for each word creates a list of ''​DocWithOccurrences<​Text>''​ containing the documents containing this word, including the occurrences.
  
 ===== Exercise 2 ===== ===== Exercise 2 =====
  
-Optional. Improve the solution to identify the documents by their ids instead of names, i.e., create for each word a sequence of ''​DocWithOccurences<​IntWritable>''​. Your solution should use two Hadoop jobs:+Optional. Improve the solution to identify the documents by their ids instead of names, i.e., create for each word a sequence of ''​DocWithOccurrences<​IntWritable>''​. Your solution should use two Hadoop jobs:
   - in the first job, create a list of unique document names. Number the documents using the order in this list.   - in the first job, create a list of unique document names. Number the documents using the order in this list.
-  - in the second job, create for each word a list of ''​DocWithOccurences<​IntWritable>'',​ where the document is identified by its number (contrary to the previous exercise, where ''​Text''​ was used to identify the document).+  - in the second job, create for each word a list of ''​DocWithOccurrences<​IntWritable>'',​ where the document is identified by its number (contrary to the previous exercise, where ''​Text''​ was used to identify the document).
  
 ---- ----
Line 145: Line 145:
 </​table>​ </​table>​
 </​html>​ </​html>​
 +

[ Back to the navigation ] [ Back to the content ]