This is an old revision of the document!
Table of Contents
English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models
UNDER CONSTRUCTION!
This page is an add-on to the following paper:
Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.
The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.
- Data
- IIIT-TIDES
- Daniel Pipes
- EMILLE
- Agrocorpus
- Shabdanjali
- Wikipedia Named Entities
- Tools and their settings
- Tokenization and normalization of the data
- Hunalign
- GIZA++
- makecls
- SRILM
- Moses
- Joshua
- Mumbai Tagger
- Affisix
- Hindomor
- HiTBSuf
- počítadlo BLEU skóre
- Link tables from the paper to concrete settings
- Link to the PDF version of the paper; link to Biblio?
Out of Vocabulary
No data have been lemmatised, so all the numbers mean forms.
data | tokens | types |
---|---|---|
Tides-train-en | 1226144 | 48048 |
Tides-train-hi | 1312435 | 53451 |
Tides+DP11-train-en | 1402536 | 52947 |
Tides+DP11-train-hi | 1434543 | 57131 |
Tides-dev-en | 22485 | 5596 |
Tides-dev-hi | 24363 | 5642 |
Tides-test-en | 27169 | 5939 |
Tides-test-hi | 28574 | 5872 |
Coverage | ||||
---|---|---|---|---|
tokens seen in train | types seen in train | |||
Tides | Tides+DP | Tides | Tides+DP | |
Tides-test-en | ||||
Tides-test-hi | ||||
Tides-dev-en | ||||
Tides-dev-hi |