Differences

This shows you the differences between two versions of the page.

--- pub-company:icon2009 [2009/10/22 16:55]
stranak
+++ pub-company:icon2009 [2009/10/22 22:14]
zeman After cleaning.
@@ Line 33: / Line 33: @@
 ===== Data =====
 ==== IIIT Tides ====
@@ Line 38: / Line 39: @@
 A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
-| **Part** | **Sentences** | **Tokens en** | **Tokens hi** |
+| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
-| train |  50,000 |  1,195,436 |  1,287,174 |
+| train |  50,000 |  1,195,436 |  6,496,995 |  1,287,174 |  15,917,598 |
-| dev |  1,000 |  21,842 |  23,851 |
+| dev |  1,000 |  21,842 |  118,239 |  23,851 |  291,147 |
-| test |  1,000 |  26,537 |  27,979 |
+| test |  1,000 |  26,537 |  145,376 |  27,979 |  348,221 |
 The test data contain 1 reference translation per sentence.
+Our preprocessing of the data included the following steps:
+  * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like "anglo-american") we wished to split into smaller tokens.
+  * Unicode normalization. E.g., devanagari "z" was rewritten as "j"+"nukta".
+  * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.
+We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:
+| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
+| train |  50,000 |  1,226,144 |  6,499,532 |  1,312,435 |  15,648,541 |
+| dev |  1,000 |  22,485 |  118,242 |  24,363 |  288,062 |
+| test |  1,000 |  27,169 |  145,528 |  28,574 |  343,288 |
 ===== Out of Vocabulary =====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences