[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
pub-company:icon2009 [2009/10/22 16:55]
stranak
pub-company:icon2009 [2009/10/22 22:05]
zeman Numbers of bytes to make the identification of the version of data more precise.
Line 33: Line 33:
  
 ===== Data ===== ===== Data =====
 +
  
 ==== IIIT Tides ==== ==== IIIT Tides ====
Line 38: Line 39:
 A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8. A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
  
-| **Part** | **Sentences** | **Tokens en** | **Tokens hi** | +| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | 
-| train |  50,000 |  1,195,436 |  1,287,174 | +| train |  50,000 |  1,195,436 |  6,496,995 |  1,287,174 |  15,917,598 
-| dev |  1,000 |  21,842 |  23,851 | +| dev |  1,000 |  21,842 |  118,239 |  23,851 |  291,147 
-| test |  1,000 |  26,537 |  27,979 |+| test |  1,000 |  26,537 |  145,376 |  27,979 |  348,221 |
  
 The test data contain 1 reference translation per sentence. The test data contain 1 reference translation per sentence.

[ Back to the navigation ] [ Back to the content ]