English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Data
- IIIT-TIDES
- Daniel Pipes
- EMILLE
- Agrocorpus
- Shabdanjali
- Wikipedia Named Entities
Tools and their settings
- Tokenization and normalization of the data
- Hunalign
- GIZA++
- makecls
- SRILM
- Moses
- Joshua
- Mumbai Tagger
- Affisix
- Hindomor
- HiTBSuf
- počítadlo BLEU skóre
Link tables from the paper to concrete settings
Link to the PDF version of the paper; link to Biblio?

Data

IIIT Tides

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

Part	Sentences	Tokens en	Bytes en	Tokens hi	Bytes hi
train	50,000	1,195,436	6,496,995	1,287,174	15,917,598
dev	1,000	21,842	118,239	23,851	291,147
test	1,000	26,537	145,376	27,979	348,221

The test data contain 1 reference translation per sentence.

Our preprocessing of the data included the following steps:

Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like “anglo-american”) we wished to split into smaller tokens.
Unicode normalization. E.g., devanagari “z” was rewritten as “j”+“nukta”.
Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.

We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:

Part	Sentences	Tokens en	Bytes en	Tokens hi	Bytes hi
train	50,000	1,226,144	6,499,532	1,312,435	15,648,541
dev	1,000	22,485	118,242	24,363	288,062
test	1,000	27,169	145,528	28,574	343,288

Out of Vocabulary

No data have been lemmatised, so all the numbers mean forms.

Vocabulary Size
data	tokens	types
Tides-train-en	1226144	48048
Tides-train-hi	1312435	53451
Tides+DP11-train-en	1402536	52947
Tides+DP11-train-hi	1434543	57131
tides.train+dictfilt-en	1227614	48349
tides.train+dictfilt-hi	1313857	53700
tides.train+DP11+dictfilt-en	1404006	53219
tides.train+DP11+dictfilt-hi	1435965	57366
set1-en	247399	20869
set1-hi	201266	16442
set2-en	368732	23316
set2-hi	328553	20769
set3-en	303059	21819
set3-hi	272276	18178
Tides-dev-en	22485	5596
Tides-dev-hi	24363	5642
Tides-test-en	27169	5939
Tides-test-hi	28574	5872

- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8), filtered to get rid of the errors.

Coverage
	tokens unseen in train							types unseen in train
	Tides	Tides+DP	Tides+dict	Tides+DP+dict	set1	set2	set3	Tides	Tides+DP	Tides+dict	Tides+DP+dict	set1	set2	set3
Tides-test-en	369	348	363 (1.336%)	343 (1.262%)	2524 (9.290%)	2330 (8.576%)	2429 (8.940%)	363	343	357 (6.011%)	338 (5.691%)	1974 (33.238%)	1824 (30.712%)	1901 (32.009%)
Tides-test-hi	839	830	836 (2.926%)	828 (2.898%)	3480 (12.179%)	3233 (11.314%)	3310 (11.584%)	642	633	639 (10.882%)	631 (10.746%)	2569 (43.750%)	2412 (41.076%)	2465 (41.979%)
Tides-dev-en	464	421			2072 (9.215%)	1732 (7.703%)	1873 (8.330%)	459	418			1750 (31.272%)	1498 (26.769%)	1608 (28.735%)
Tides-dev-hi	619	607			2946 (12.092%)	2546 (10.450%)	2661 (10.922%)	580	568			2325 (41.209%)	2037 (36.104%)	2129 (37.735%)

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

Data

IIIT Tides

Out of Vocabulary