This is an old revision of the document!
Table of Contents
English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models
UNDER CONSTRUCTION!
This page is an add-on to the following paper:
Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.
The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.
- Data
- IIIT-TIDES
- Daniel Pipes
- EMILLE
- Agrocorpus
- Shabdanjali
- Wikipedia Named Entities
- Tools and their settings
- Tokenization and normalization of the data
- Hunalign
- GIZA++
- makecls
- SRILM
- Moses
- Joshua
- Mumbai Tagger
- Affisix
- Hindomor
- HiTBSuf
- počítadlo BLEU skóre
- Link tables from the paper to concrete settings
- Link to the PDF version of the paper; link to Biblio?
Out of Vocabulary
data | tokens | types | tokens in train | types in train |
Tides-train-en | 1226144 | 48048 | ||
Tides-train-hi | 1312435 | 53451 | ||
Tides+DP11-train-en | 1402536 | 52947 | ||
Tides+DP11-train-hi | 1434543 | 57131 |