Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
pub-company:icon2009 [2009/10/22 22:05] zeman Numbers of bytes to make the identification of the version of data more precise. |
pub-company:icon2009 [2010/03/25 10:01] (current) stranak |
||
---|---|---|---|
Line 45: | Line 45: | ||
The test data contain 1 reference translation per sentence. | The test data contain 1 reference translation per sentence. | ||
+ | |||
+ | Our preprocessing of the data included the following steps: | ||
+ | |||
+ | * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like " | ||
+ | * Unicode normalization. E.g., devanagari " | ||
+ | * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same. | ||
+ | |||
+ | We are planning on making all the cleaning scripts available here. After preprocessing, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | | ||
+ | | train | 50,000 | 1,226,144 | 6,499,532 | 1,312,435 | 15,648,541 | | ||
+ | | dev | 1,000 | 22,485 | 118,242 | 24,363 | 288,062 | | ||
+ | | test | 1,000 | 27,169 | 145,528 | 28,574 | 343,288 | | ||
===== Out of Vocabulary ===== | ===== Out of Vocabulary ===== | ||
Line 54: | Line 67: | ||
| **Tides+DP11-train-en** | 1402536 | 52947 | | | **Tides+DP11-train-en** | 1402536 | 52947 | | ||
| **Tides+DP11-train-hi** | 1434543 | 57131 | | | **Tides+DP11-train-hi** | 1434543 | 57131 | | ||
+ | | **tides.train+dictfilt-en** | ||
+ | | **tides.train+dictfilt-hi** | ||
+ | | **tides.train+DP11+dictfilt-en** | ||
+ | | **tides.train+DP11+dictfilt-hi** | ||
| **set1-en** | | **set1-en** | ||
| **set1-hi** | | **set1-hi** | ||
- | | **set2-en** | + | | **set2-en** |
+ | | **set2-hi** | ||
+ | | **set3-en** | ||
+ | | **set3-hi** | ||
| **Tides-dev-en** | | **Tides-dev-en** | ||
| **Tides-dev-hi** | | **Tides-dev-hi** | ||
| **Tides-test-en** | | **Tides-test-en** | ||
| **Tides-test-hi** | | **Tides-test-hi** | ||
+ | |||
- set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
- set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
- set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 | ||
+ | - dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8). Filtered to get rid of the errors, then expanded entries with multiple meanings to separate entries, then filtered to keep onlu word that occur in the large Hindi monolingual corpus. | ||
+ | |||
+ | ^ | ||
+ | | | **tokens unseen in train** | ||
+ | | | ||
+ | | **Tides-test-en** | | ||
+ | | **Tides-test-hi** | | ||
+ | | **Tides-dev-en** | ||
+ | | **Tides-dev-hi** | ||
+ | |||
- | ^ | ||
- | | | **tokens unseen in train** | ||
- | | | ||
- | | **Tides-test-en** | | ||
- | | **Tides-test-hi** | | ||
- | | **Tides-dev-en** | ||
- | | **Tides-dev-hi** | ||