[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
pub-company:icon2009 [2009/10/21 17:22]
stranak
pub-company:icon2009 [2009/10/22 15:36]
stranak
Line 32: Line 32:
   * Link to the PDF version of the paper; link to Biblio?   * Link to the PDF version of the paper; link to Biblio?
  
-==== Out of Vocabulary ====+===== Data ===== 
 + 
 +==== IIIT Tides ==== 
 + 
 +A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8. 
 + 
 +| **Part** | **Sentences** | **Tokens en** | **Tokens hi** | 
 +| train |  50,000 |  1,195,436 |  1,287,174 | 
 +| dev |  1,000 |  21,842 |  23,851 | 
 +| test |  1,000 |  26,537 |  27,979 | 
 + 
 +The test data contain 1 reference translation per sentence. 
 + 
 +===== Out of Vocabulary =====
 No data have been lemmatised, so  all the numbers mean forms. No data have been lemmatised, so  all the numbers mean forms.
 ^           Vocabulary Size               ^^^ ^           Vocabulary Size               ^^^
Line 47: Line 60:
  
 ^         Coverage               ^^^^^^^^^ ^         Coverage               ^^^^^^^^^
-|                   | **tokens seen in train**  ||||  **types seen in train**  |||| +|                   | **tokens unseen in train**  ||  **types unseen in train**  || 
-|                    //Tides//  ||  //Tides+DP//  ||  //Tides//  || //Tides+DP// || +|                    //Tides//  |  //Tides+DP//  |  //Tides//  | //Tides+DP//   | 
-|                   | abs |  OOV  | abs |    OOV   | abs |  OOV  | abs |    OOV   | +| **Tides-test-en** |   369           348          363             343      
-| **Tides-test-en** |                                                       +| **Tides-test-hi** |   839           830          642             633      
-| **Tides-test-hi** |                                                       +| **Tides-dev-en**   464           421          459             418      
-| **Tides-dev-en**                                                       +| **Tides-dev-hi**   619           607          580             568      |
-| **Tides-dev-hi**                                                       |+
  

[ Back to the navigation ] [ Back to the content ]