[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Data

IIIT Tides

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

Part Sentences Tokens en Tokens hi
train 50,000 1,195,436 1,287,174
dev 1,000 21,842 23,851
test 1,000 26,537 27,979

The test data contain 1 reference translation per sentence.

Out of Vocabulary

No data have been lemmatised, so all the numbers mean forms.

Vocabulary Size
data tokens types
Tides-train-en 1226144 48048
Tides-train-hi 1312435 53451
Tides+DP11-train-en 1402536 52947
Tides+DP11-train-hi 1434543 57131
Tides-dev-en 22485 5596
Tides-dev-hi 24363 5642
Tides-test-en 27169 5939
Tides-test-hi 28574 5872
Coverage
tokens seen in train types seen in train
Tides Tides+DP Tides Tides+DP
abs OOV abs OOV abs OOV abs OOV
Tides-test-en
Tides-test-hi
Tides-dev-en
Tides-dev-hi

[ Back to the navigation ] [ Back to the content ]