[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Out of Vocabulary

No data have been lemmatised, so all the numbers mean forms.

Vocabulary Size
data tokens types
Tides-train-en 1226144 48048
Tides-train-hi 1312435 53451
Tides+DP11-train-en 1402536 52947
Tides+DP11-train-hi 1434543 57131
Tides-dev-en 22485 5596
Tides-dev-hi 24363 5642
Tides-test-en 27169 5939
Tides-test-hi 28574 5872
Coverage
tokens seen in train types seen in train
Tides Tides+DP Tides Tides+DP
Tides-test-en
Tides-test-hi
Tides-dev-en
Tides-dev-hi

[ Back to the navigation ] [ Back to the content ]