====== English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models ======

====== UNDER CONSTRUCTION! ======

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

  * Data
    * IIIT-TIDES
    * Daniel Pipes
    * EMILLE
    * Agrocorpus
    * Shabdanjali
    * Wikipedia Named Entities
  * Tools and their settings
    * Tokenization and normalization of the data
    * Hunalign
    * GIZA++
    * makecls
    * SRILM
    * Moses
    * Joshua
    * Mumbai Tagger
    * Affisix
    * Hindomor
    * HiTBSuf
    * počítadlo BLEU skóre
  * Link tables from the paper to concrete settings
  * Link to the PDF version of the paper; link to Biblio?

===== Data =====


==== IIIT Tides ====

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
| train |  50,000 |  1,195,436 |  6,496,995 |  1,287,174 |  15,917,598 |
| dev |  1,000 |  21,842 |  118,239 |  23,851 |  291,147 |
| test |  1,000 |  26,537 |  145,376 |  27,979 |  348,221 |

The test data contain 1 reference translation per sentence.

Our preprocessing of the data included the following steps:

  * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like "anglo-american") we wished to split into smaller tokens.
  * Unicode normalization. E.g., devanagari "z" was rewritten as "j"+"nukta".
  * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.

We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:

| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
| train |  50,000 |  1,226,144 |  6,499,532 |  1,312,435 |  15,648,541 |
| dev |  1,000 |  22,485 |  118,242 |  24,363 |  288,062 |
| test |  1,000 |  27,169 |  145,528 |  28,574 |  343,288 |

===== Out of Vocabulary =====
No data have been lemmatised, so  all the numbers mean forms.
^           Vocabulary Size               ^^^
^ data                ^ tokens  ^ types ^
| **Tides-train-en**      | 1226144 | 48048 |
| **Tides-train-hi**      | 1312435 | 53451 |
| **Tides+DP11-train-en** | 1402536 | 52947 |
| **Tides+DP11-train-hi** | 1434543 | 57131 |
| **tides.train+dictfilt-en**  |  1227614  |  48349  |
| **tides.train+dictfilt-hi**  |  1313857  |  53700  |
| **tides.train+DP11+dictfilt-en**  |  1404006  |  53219  |
| **tides.train+DP11+dictfilt-hi**  |  1435965  |  57366  |
| **set1-en**             |  247399 | 20869 |
| **set1-hi**             |  201266 | 16442 |
| **set2-en**             |  368732 | 23316 |
| **set2-hi**             |  328553 | 20769 |
| **set3-en**             |  303059 | 21819 |
| **set3-hi**             |  272276 | 18178 |
| **Tides-dev-en**        |   22485 |  5596 |
| **Tides-dev-hi**        |   24363 |  5642 |
| **Tides-test-en**       |   27169 |  5939 |
| **Tides-test-hi**       |   28574 |  5872 |


 - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
 - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
 - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
 - dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8). Filtered to get rid of the errors, then expanded entries with multiple meanings to separate entries, then filtered to keep onlu word that occur in the large Hindi monolingual corpus.

^         Coverage               ^^^^^^^^^^^^^^^
|                   | **tokens unseen in train**  |||||||  **types unseen in train**  |||||||
|                   |  //Tides//  |  //Tides+DP// |  //Tides+dict// |  //Tides+DP+dict// |   //set1//       |  //set2//        |  //set3//        |  //Tides//      | //Tides+DP//    |  //Tides+dict// |  //Tides+DP+dict// |  //set1//        |  //set2 //       |  //set3//        |
| **Tides-test-en** |   369       |     348       |  363 (1.336%)   |  343 (1.262%)      |  2524 (9.290%)   |  2330 (8.576%)   |  2429 (8.940%)   |    363          |       343       |  357 (6.011%)   |  338 (5.691%)      |  1974 (33.238%)  |  1824 (30.712%)  |  1901 (32.009%)  |
| **Tides-test-hi** |   839       |     830       |  836 (2.926%)   |  828 (2.898%)      |  3480 (12.179%)  |  3233 (11.314%)  |  3310 (11.584%)  |   642           |       633       |  639 (10.882%)  |  631 (10.746%)     |  2569 (43.750%)  |  2412 (41.076%)  |  2465 (41.979%)  |
| **Tides-dev-en**  |   464       |     421       |  462 (2.055%)   |  419 (1.863%)      |  2072 (9.215%)   |  1732 (7.703%)   |  1873 (8.330%)   |   459           |       418       |  457 (8.167%)   |  416 (7.434%)      |  1750 (31.272%)  |  1498 (26.769%)  |  1608 (28.735%)  |
| **Tides-dev-hi**  |   619       |     607       |  618 (2.537%)   |  606 (2.487%)      | 2946 (12.092%)   |  2546 (10.450%)  |  2661 (10.922%)  |   580           |       568       |  579 (10.262%)  |  567 (10.050%)     |  2325 (41.209%)  |  2037 (36.104%)  |  2129 (37.735%)  |