Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
pub-company:icon2009 [2009/10/16 10:09] zeman vytvořeno |
pub-company:icon2009 [2010/03/24 18:11] stranak |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models ====== | ====== English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models ====== | ||
+ | |||
====== UNDER CONSTRUCTION! ====== | ====== UNDER CONSTRUCTION! ====== | ||
Line 30: | Line 31: | ||
* Link tables from the paper to concrete settings | * Link tables from the paper to concrete settings | ||
* Link to the PDF version of the paper; link to Biblio? | * Link to the PDF version of the paper; link to Biblio? | ||
+ | |||
+ | ===== Data ===== | ||
+ | |||
+ | |||
+ | ==== IIIT Tides ==== | ||
+ | |||
+ | A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | | ||
+ | | train | 50,000 | 1,195,436 | 6,496,995 | 1,287,174 | 15,917,598 | | ||
+ | | dev | 1,000 | 21,842 | 118,239 | 23,851 | 291,147 | | ||
+ | | test | 1,000 | 26,537 | 145,376 | 27,979 | 348,221 | | ||
+ | |||
+ | The test data contain 1 reference translation per sentence. | ||
+ | |||
+ | Our preprocessing of the data included the following steps: | ||
+ | |||
+ | * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like " | ||
+ | * Unicode normalization. E.g., devanagari " | ||
+ | * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same. | ||
+ | |||
+ | We are planning on making all the cleaning scripts available here. After preprocessing, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | | ||
+ | | train | 50,000 | 1,226,144 | 6,499,532 | 1,312,435 | 15,648,541 | | ||
+ | | dev | 1,000 | 22,485 | 118,242 | 24,363 | 288,062 | | ||
+ | | test | 1,000 | 27,169 | 145,528 | 28,574 | 343,288 | | ||
+ | |||
+ | ===== Out of Vocabulary ===== | ||
+ | No data have been lemmatised, so all the numbers mean forms. | ||
+ | ^ | ||
+ | ^ data ^ tokens | ||
+ | | **Tides-train-en** | ||
+ | | **Tides-train-hi** | ||
+ | | **Tides+DP11-train-en** | 1402536 | 52947 | | ||
+ | | **Tides+DP11-train-hi** | 1434543 | 57131 | | ||
+ | | **tides.train+dictfilt-en** | ||
+ | | **tides.train+dictfilt-hi** | ||
+ | | **tides.train+DP11+dictfilt-en** | ||
+ | | **tides.train+DP11+dictfilt-hi** | ||
+ | | **set1-en** | ||
+ | | **set1-hi** | ||
+ | | **set2-en** | ||
+ | | **set2-hi** | ||
+ | | **set3-en** | ||
+ | | **set3-hi** | ||
+ | | **Tides-dev-en** | ||
+ | | **Tides-dev-hi** | ||
+ | | **Tides-test-en** | ||
+ | | **Tides-test-hi** | ||
+ | |||
+ | |||
+ | - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
+ | - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
+ | - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
+ | - dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8), | ||
+ | |||
+ | ^ | ||
+ | | | **tokens unseen in train** | ||
+ | | | ||
+ | | **Tides-test-en** | | ||
+ | | **Tides-test-hi** | | ||
+ | | **Tides-dev-en** | ||
+ | | **Tides-dev-hi** | ||
+ | |||
+ | |||
+ |