[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Data

IIIT Tides

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

Part Sentences Tokens en Tokens hi
train 50,000 1,195,436 1,287,174
dev 1,000 21,842 23,851
test 1,000 26,537 27,979

The test data contain 1 reference translation per sentence.

Out of Vocabulary

No data have been lemmatised, so all the numbers mean forms.

Vocabulary Size
data tokens types
Tides-train-en 1226144 48048
Tides-train-hi 1312435 53451
Tides+DP11-train-en 1402536 52947
Tides+DP11-train-hi 1434543 57131
set1-en 247399 20869
set1-hi 201266 16442
set2-en 247399 20869
Tides-dev-en 22485 5596
Tides-dev-hi 24363 5642
Tides-test-en 27169 5939
Tides-test-hi 28574 5872

- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005

Coverage
tokens unseen in train types unseen in train
Tides Tides+DP set1 Tides Tides+DP set1
Tides-test-en 369 348 2524 (9.290%) 363 343 1974 (33.238%)
Tides-test-hi 839 830 3480 (12.179%) 642 633 2569 (43.750%)
Tides-dev-en 464 421 2072 (9.215%) 459 418 1750 (31.272%)
Tides-dev-hi 619 607 2946 (12.092%) 580 568 2325 (41.209%)

[ Back to the navigation ] [ Back to the content ]