English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

This page is an add-on to the following paper:

Ondřej Bojar, Pavel Straňák, Daniel Zeman, Gaurav Jain, Michal Hrušecký, Michal Richter, Jan Hajič: English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models. In: Proceedings of ICON 2009, Hyderabad, December.

The purpose of the add-on page is to provide detailed documentation of the data, tools and settings used so that the results can be reproduced by other researchers.

Data
- IIIT-TIDES
- Daniel Pipes
- EMILLE
- Agrocorpus
- Shabdanjali
- Wikipedia Named Entities
Tools and their settings
- Tokenization and normalization of the data
- Hunalign
- GIZA++
- makecls
- SRILM
- Moses
- Joshua
- Mumbai Tagger
- Affisix
- Hindomor
- HiTBSuf
- počítadlo BLEU skóre
Link tables from the paper to concrete settings
Link to the PDF version of the paper; link to Biblio?

Data

IIIT Tides

A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.

Part	Sentences	Tokens en	Bytes en	Tokens hi	Bytes hi
train	50,000	1,195,436	6,496,995	1,287,174	15,917,598
dev	1,000	21,842	118,239	23,851	291,147
test	1,000	26,537	145,376	27,979	348,221

The test data contain 1 reference translation per sentence.

Out of Vocabulary

No data have been lemmatised, so all the numbers mean forms.

Vocabulary Size
data	tokens	types
Tides-train-en	1226144	48048
Tides-train-hi	1312435	53451
Tides+DP11-train-en	1402536	52947
Tides+DP11-train-hi	1434543	57131
set1-en	247399	20869
set1-hi	201266	16442
set2-en	247399	20869
Tides-dev-en	22485	5596
Tides-dev-hi	24363	5642
Tides-test-en	27169	5939
Tides-test-hi	28574	5872

- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005

Coverage
	tokens unseen in train			types unseen in train
	Tides	Tides+DP	set1	Tides	Tides+DP	set1
Tides-test-en	369	348	2524 (9.290%)	363	343	1974 (33.238%)
Tides-test-hi	839	830	3480 (12.179%)	642	633	2569 (43.750%)
Tides-dev-en	464	421	2072 (9.215%)	459	418	1750 (31.272%)
Tides-dev-hi	619	607	2946 (12.092%)	580	568	2325 (41.209%)

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models

UNDER CONSTRUCTION!

Data

IIIT Tides

Out of Vocabulary