Table of Contents

English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Data

IIIT Tides

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

Part Sentences Tokens en Bytes en Tokens hi Bytes hi
train 50,000 1,195,436 6,496,995 1,287,174 15,917,598
dev 1,000 21,842 118,239 23,851 291,147
test 1,000 26,537 145,376 27,979 348,221

The test data contain 1 reference translation per sentence.

Our preprocessing of the data included the following steps:

We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:

Part Sentences Tokens en Bytes en Tokens hi Bytes hi
train 50,000 1,226,144 6,499,532 1,312,435 15,648,541
dev 1,000 22,485 118,242 24,363 288,062
test 1,000 27,169 145,528 28,574 343,288

Out of Vocabulary

No data have been lemmatised, so all the numbers mean forms.

Vocabulary Size
data tokens types
Tides-train-en 1226144 48048
Tides-train-hi 1312435 53451
Tides+DP11-train-en 1402536 52947
Tides+DP11-train-hi 1434543 57131
tides.train+dictfilt-en 1227614 48349
tides.train+dictfilt-hi 1313857 53700
tides.train+DP11+dictfilt-en 1404006 53219
tides.train+DP11+dictfilt-hi 1435965 57366
set1-en 247399 20869
set1-hi 201266 16442
set2-en 368732 23316
set2-hi 328553 20769
set3-en 303059 21819
set3-hi 272276 18178
Tides-dev-en 22485 5596
Tides-dev-hi 24363 5642
Tides-test-en 27169 5939
Tides-test-hi 28574 5872

- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8). Filtered to get rid of the errors, then expanded entries with multiple meanings to separate entries, then filtered to keep onlu word that occur in the large Hindi monolingual corpus.

Coverage
tokens unseen in train types unseen in train
Tides Tides+DP Tides+dict Tides+DP+dict set1 set2 set3 Tides Tides+DP Tides+dict Tides+DP+dict set1 set2 set3
Tides-test-en 369 348 363 (1.336%) 343 (1.262%) 2524 (9.290%) 2330 (8.576%) 2429 (8.940%) 363 343 357 (6.011%) 338 (5.691%) 1974 (33.238%) 1824 (30.712%) 1901 (32.009%)
Tides-test-hi 839 830 836 (2.926%) 828 (2.898%) 3480 (12.179%) 3233 (11.314%) 3310 (11.584%) 642 633 639 (10.882%) 631 (10.746%) 2569 (43.750%) 2412 (41.076%) 2465 (41.979%)
Tides-dev-en 464 421 462 (2.055%) 419 (1.863%) 2072 (9.215%) 1732 (7.703%) 1873 (8.330%) 459 418 457 (8.167%) 416 (7.434%) 1750 (31.272%) 1498 (26.769%) 1608 (28.735%)
Tides-dev-hi 619 607 618 (2.537%) 606 (2.487%) 2946 (12.092%) 2546 (10.450%) 2661 (10.922%) 580 568 579 (10.262%) 567 (10.050%) 2325 (41.209%) 2037 (36.104%) 2129 (37.735%)