Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
pub-company:icon2009 [2009/10/22 22:05] zeman Numbers of bytes to make the identification of the version of data more precise. |
pub-company:icon2009 [2009/10/22 22:14] zeman After cleaning. |
||
---|---|---|---|
Line 45: | Line 45: | ||
The test data contain 1 reference translation per sentence. | The test data contain 1 reference translation per sentence. | ||
+ | |||
+ | Our preprocessing of the data included the following steps: | ||
+ | |||
+ | * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like " | ||
+ | * Unicode normalization. E.g., devanagari " | ||
+ | * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same. | ||
+ | |||
+ | We are planning on making all the cleaning scripts available here. After preprocessing, | ||
+ | |||
+ | | **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | | ||
+ | | train | 50,000 | 1,226,144 | 6,499,532 | 1,312,435 | 15,648,541 | | ||
+ | | dev | 1,000 | 22,485 | 118,242 | 24,363 | 288,062 | | ||
+ | | test | 1,000 | 27,169 | 145,528 | 28,574 | 343,288 | | ||
===== Out of Vocabulary ===== | ===== Out of Vocabulary ===== |