This is an old revision of the document!
Table of Contents
English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models
UNDER CONSTRUCTION!
This page is an add-on to the following paper:
Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.
The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.
- Data
- IIIT-TIDES
- Daniel Pipes
- EMILLE
- Agrocorpus
- Shabdanjali
- Wikipedia Named Entities
- Tools and their settings
- Tokenization and normalization of the data
- Hunalign
- GIZA++
- makecls
- SRILM
- Moses
- Joshua
- Mumbai Tagger
- Affisix
- Hindomor
- HiTBSuf
- počítadlo BLEU skóre
- Link tables from the paper to concrete settings
- Link to the PDF version of the paper; link to Biblio?
Data
IIIT Tides
A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
Part | Sentences | Tokens en | Bytes en | Tokens hi | Bytes hi |
train | 50,000 | 1,195,436 | 6,496,995 | 1,287,174 | 15,917,598 |
dev | 1,000 | 21,842 | 118,239 | 23,851 | 291,147 |
test | 1,000 | 26,537 | 145,376 | 27,979 | 348,221 |
The test data contain 1 reference translation per sentence.
Out of Vocabulary
No data have been lemmatised, so all the numbers mean forms.
Vocabulary Size | ||
---|---|---|
data | tokens | types |
Tides-train-en | 1226144 | 48048 |
Tides-train-hi | 1312435 | 53451 |
Tides+DP11-train-en | 1402536 | 52947 |
Tides+DP11-train-hi | 1434543 | 57131 |
set1-en | 247399 | 20869 |
set1-hi | 201266 | 16442 |
set2-en | 247399 | 20869 |
Tides-dev-en | 22485 | 5596 |
Tides-dev-hi | 24363 | 5642 |
Tides-test-en | 27169 | 5939 |
Tides-test-hi | 28574 | 5872 |
- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
Coverage | ||||||||
---|---|---|---|---|---|---|---|---|
tokens unseen in train | types unseen in train | |||||||
Tides | Tides+DP | set1 | Tides | Tides+DP | set1 | |||
Tides-test-en | 369 | 348 | 2524 (9.290%) | 363 | 343 | 1974 (33.238%) | ||
Tides-test-hi | 839 | 830 | 3480 (12.179%) | 642 | 633 | 2569 (43.750%) | ||
Tides-dev-en | 464 | 421 | 2072 (9.215%) | 459 | 418 | 1750 (31.272%) | ||
Tides-dev-hi | 619 | 607 | 2946 (12.092%) | 580 | 568 | 2325 (41.209%) |