This page is an add-on to the following paper:
Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.
The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.
A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
| Part | Sentences | Tokens en | Bytes en | Tokens hi | Bytes hi |
| train | 50,000 | 1,195,436 | 6,496,995 | 1,287,174 | 15,917,598 |
| dev | 1,000 | 21,842 | 118,239 | 23,851 | 291,147 |
| test | 1,000 | 26,537 | 145,376 | 27,979 | 348,221 |
The test data contain 1 reference translation per sentence.
Our preprocessing of the data included the following steps:
We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:
| Part | Sentences | Tokens en | Bytes en | Tokens hi | Bytes hi |
| train | 50,000 | 1,226,144 | 6,499,532 | 1,312,435 | 15,648,541 |
| dev | 1,000 | 22,485 | 118,242 | 24,363 | 288,062 |
| test | 1,000 | 27,169 | 145,528 | 28,574 | 343,288 |
No data have been lemmatised, so all the numbers mean forms.
| Vocabulary Size | ||
|---|---|---|
| data | tokens | types |
| Tides-train-en | 1226144 | 48048 |
| Tides-train-hi | 1312435 | 53451 |
| Tides+DP11-train-en | 1402536 | 52947 |
| Tides+DP11-train-hi | 1434543 | 57131 |
| tides.train+dictfilt-en | 1227614 | 48349 |
| tides.train+dictfilt-hi | 1313857 | 53700 |
| tides.train+DP11+dictfilt-en | 1404006 | 53219 |
| tides.train+DP11+dictfilt-hi | 1435965 | 57366 |
| set1-en | 247399 | 20869 |
| set1-hi | 201266 | 16442 |
| set2-en | 368732 | 23316 |
| set2-hi | 328553 | 20769 |
| set3-en | 303059 | 21819 |
| set3-hi | 272276 | 18178 |
| Tides-dev-en | 22485 | 5596 |
| Tides-dev-hi | 24363 | 5642 |
| Tides-test-en | 27169 | 5939 |
| Tides-test-hi | 28574 | 5872 |
- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8). Filtered to get rid of the errors, then expanded entries with multiple meanings to separate entries, then filtered to keep onlu word that occur in the large Hindi monolingual corpus.
| Coverage | ||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| tokens unseen in train | types unseen in train | |||||||||||||
| Tides | Tides+DP | Tides+dict | Tides+DP+dict | set1 | set2 | set3 | Tides | Tides+DP | Tides+dict | Tides+DP+dict | set1 | set2 | set3 | |
| Tides-test-en | 369 | 348 | 363 (1.336%) | 343 (1.262%) | 2524 (9.290%) | 2330 (8.576%) | 2429 (8.940%) | 363 | 343 | 357 (6.011%) | 338 (5.691%) | 1974 (33.238%) | 1824 (30.712%) | 1901 (32.009%) |
| Tides-test-hi | 839 | 830 | 836 (2.926%) | 828 (2.898%) | 3480 (12.179%) | 3233 (11.314%) | 3310 (11.584%) | 642 | 633 | 639 (10.882%) | 631 (10.746%) | 2569 (43.750%) | 2412 (41.076%) | 2465 (41.979%) |
| Tides-dev-en | 464 | 421 | 462 (2.055%) | 419 (1.863%) | 2072 (9.215%) | 1732 (7.703%) | 1873 (8.330%) | 459 | 418 | 457 (8.167%) | 416 (7.434%) | 1750 (31.272%) | 1498 (26.769%) | 1608 (28.735%) |
| Tides-dev-hi | 619 | 607 | 618 (2.537%) | 606 (2.487%) | 2946 (12.092%) | 2546 (10.450%) | 2661 (10.922%) | 580 | 568 | 579 (10.262%) | 567 (10.050%) | 2325 (41.209%) | 2037 (36.104%) | 2129 (37.735%) |