Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
pub-company:icon2009 [2009/10/21 17:22] stranak |
pub-company:icon2009 [2009/10/22 15:36] stranak |
||
---|---|---|---|
Line 32: | Line 32: | ||
* Link to the PDF version of the paper; link to Biblio? | * Link to the PDF version of the paper; link to Biblio? | ||
- | ==== Out of Vocabulary ==== | + | ===== Data ===== |
+ | |||
+ | ==== IIIT Tides ==== | ||
+ | |||
+ | A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Tokens hi** | | ||
+ | | train | 50,000 | 1,195,436 | 1,287,174 | | ||
+ | | dev | 1,000 | 21,842 | 23,851 | | ||
+ | | test | 1,000 | 26,537 | 27,979 | | ||
+ | |||
+ | The test data contain 1 reference translation per sentence. | ||
+ | |||
+ | ===== Out of Vocabulary | ||
No data have been lemmatised, so all the numbers mean forms. | No data have been lemmatised, so all the numbers mean forms. | ||
^ | ^ | ||
Line 47: | Line 60: | ||
^ | ^ | ||
- | | | **tokens | + | | | **tokens |
- | | | + | | |
- | | | abs | OOV | abs | OOV | abs | OOV | abs | OOV | | + | | **Tides-test-en** | 369 | 348 |
- | | **Tides-test-en** | | + | | **Tides-test-hi** | 839 | 830 |
- | | **Tides-test-hi** | | + | | **Tides-dev-en** |
- | | **Tides-dev-en** | + | | **Tides-dev-hi** |
- | | **Tides-dev-hi** | + | |