Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
pub-company:icon2009 [2009/10/20 23:24] stranak |
pub-company:icon2009 [2009/10/22 14:22] zeman Tides statistics. |
||
---|---|---|---|
Line 32: | Line 32: | ||
* Link to the PDF version of the paper; link to Biblio? | * Link to the PDF version of the paper; link to Biblio? | ||
- | ==== Out of Vocabulary ==== | + | ===== Data ===== |
+ | |||
+ | ==== IIIT Tides ==== | ||
+ | |||
+ | A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Tokens hi** | | ||
+ | | train | 50,000 | 1,195,436 | 1,287,174 | | ||
+ | | dev | 1,000 | 21,842 | 23,851 | | ||
+ | | test | 1,000 | 26,537 | 27,979 | | ||
+ | |||
+ | The test data contain 1 reference translation per sentence. | ||
+ | |||
+ | ===== Out of Vocabulary | ||
No data have been lemmatised, so all the numbers mean forms. | No data have been lemmatised, so all the numbers mean forms. | ||
+ | ^ | ||
^ data ^ tokens | ^ data ^ tokens | ||
| **Tides-train-en** | | **Tides-train-en** | ||
Line 45: | Line 59: | ||
- | ^ | + | ^ |
- | | | + | | | **tokens seen in train** |
- | | | + | | |
+ | | | abs | OOV | abs | OOV | abs | OOV | abs | OOV | | ||
| **Tides-test-en** | | | **Tides-test-en** | | ||
| **Tides-test-hi** | | | **Tides-test-hi** | |