Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
pub-company:icon2009 [2009/10/20 23:24] stranak |
pub-company:icon2009 [2009/10/22 16:54] stranak |
||
---|---|---|---|
Line 32: | Line 32: | ||
* Link to the PDF version of the paper; link to Biblio? | * Link to the PDF version of the paper; link to Biblio? | ||
- | ==== Out of Vocabulary ==== | + | ===== Data ===== |
+ | |||
+ | ==== IIIT Tides ==== | ||
+ | |||
+ | A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Tokens hi** | | ||
+ | | train | 50,000 | 1,195,436 | 1,287,174 | | ||
+ | | dev | 1,000 | 21,842 | 23,851 | | ||
+ | | test | 1,000 | 26,537 | 27,979 | | ||
+ | |||
+ | The test data contain 1 reference translation per sentence. | ||
+ | |||
+ | ===== Out of Vocabulary | ||
No data have been lemmatised, so all the numbers mean forms. | No data have been lemmatised, so all the numbers mean forms. | ||
+ | ^ | ||
^ data ^ tokens | ^ data ^ tokens | ||
| **Tides-train-en** | | **Tides-train-en** | ||
Line 39: | Line 53: | ||
| **Tides+DP11-train-en** | 1402536 | 52947 | | | **Tides+DP11-train-en** | 1402536 | 52947 | | ||
| **Tides+DP11-train-hi** | 1434543 | 57131 | | | **Tides+DP11-train-hi** | 1434543 | 57131 | | ||
+ | | **set1-en** | ||
+ | | **set1-hi** | ||
+ | | **set2-en** | ||
| **Tides-dev-en** | | **Tides-dev-en** | ||
| **Tides-dev-hi** | | **Tides-dev-hi** | ||
Line 44: | Line 61: | ||
| **Tides-test-hi** | | **Tides-test-hi** | ||
+ | - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
+ | - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
+ | - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
- | ^ | + | ^ |
- | | | + | | | **tokens |
- | | | + | | |
- | | **Tides-test-en** | | + | | **Tides-test-en** | 369 | 348 | 2524 (9.290%) |
- | | **Tides-test-hi** | | + | | **Tides-test-hi** | 839 | 830 | 3480 (12.179%) |
- | | **Tides-dev-en** | + | | **Tides-dev-en** |
- | | **Tides-dev-hi** | + | | **Tides-dev-hi** |