[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
pub-company:icon2009 [2009/10/21 17:22]
stranak
pub-company:icon2009 [2009/10/22 14:22]
zeman Tides statistics.
Line 32: Line 32:
   * Link to the PDF version of the paper; link to Biblio?   * Link to the PDF version of the paper; link to Biblio?
  
-==== Out of Vocabulary ====+===== Data ===== 
 + 
 +==== IIIT Tides ==== 
 + 
 +A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8. 
 + 
 +| **Part** | **Sentences** | **Tokens en** | **Tokens hi** | 
 +| train |  50,000 |  1,195,436 |  1,287,174 | 
 +| dev |  1,000 |  21,842 |  23,851 | 
 +| test |  1,000 |  26,537 |  27,979 | 
 + 
 +The test data contain 1 reference translation per sentence. 
 + 
 +===== Out of Vocabulary =====
 No data have been lemmatised, so  all the numbers mean forms. No data have been lemmatised, so  all the numbers mean forms.
 ^           Vocabulary Size               ^^^ ^           Vocabulary Size               ^^^

[ Back to the navigation ] [ Back to the content ]