Differences

This shows you the differences between two versions of the page.

--- pub-company:icon2009 [2009/10/22 22:05]
zeman Numbers of bytes to make the identification of the version of data more precise.
+++ pub-company:icon2009 [2009/10/22 22:14]
zeman After cleaning.
@@ Line 45: / Line 45: @@
 The test data contain 1 reference translation per sentence.
+Our preprocessing of the data included the following steps:
+  * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like "anglo-american") we wished to split into smaller tokens.
+  * Unicode normalization. E.g., devanagari "z" was rewritten as "j"+"nukta".
+  * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.
+We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:
+| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
+| train |  50,000 |  1,226,144 |  6,499,532 |  1,312,435 |  15,648,541 |
+| dev |  1,000 |  22,485 |  118,242 |  24,363 |  288,062 |
+| test |  1,000 |  27,169 |  145,528 |  28,574 |  343,288 |
 ===== Out of Vocabulary =====

Institute of Formal and Applied Linguistics Wiki