[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
pub-company:icon2009 [2009/10/22 15:36]
stranak
pub-company:icon2009 [2010/03/25 09:53]
stranak dalsi OOV
Line 33: Line 33:
  
 ===== Data ===== ===== Data =====
 +
  
 ==== IIIT Tides ==== ==== IIIT Tides ====
Line 38: Line 39:
 A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8. A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
  
-| **Part** | **Sentences** | **Tokens en** | **Tokens hi** | +| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | 
-| train |  50,000 |  1,195,436 |  1,287,174 | +| train |  50,000 |  1,195,436 |  6,496,995 |  1,287,174 |  15,917,598 
-| dev |  1,000 |  21,842 |  23,851 | +| dev |  1,000 |  21,842 |  118,239 |  23,851 |  291,147 
-| test |  1,000 |  26,537 |  27,979 |+| test |  1,000 |  26,537 |  145,376 |  27,979 |  348,221 |
  
 The test data contain 1 reference translation per sentence. The test data contain 1 reference translation per sentence.
 +
 +Our preprocessing of the data included the following steps:
 +
 +  * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like "anglo-american") we wished to split into smaller tokens.
 +  * Unicode normalization. E.g., devanagari "z" was rewritten as "j"+"nukta".
 +  * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.
 +
 +We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:
 +
 +| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
 +| train |  50,000 |  1,226,144 |  6,499,532 |  1,312,435 |  15,648,541 |
 +| dev |  1,000 |  22,485 |  118,242 |  24,363 |  288,062 |
 +| test |  1,000 |  27,169 |  145,528 |  28,574 |  343,288 |
  
 ===== Out of Vocabulary ===== ===== Out of Vocabulary =====
Line 53: Line 67:
 | **Tides+DP11-train-en** | 1402536 | 52947 | | **Tides+DP11-train-en** | 1402536 | 52947 |
 | **Tides+DP11-train-hi** | 1434543 | 57131 | | **Tides+DP11-train-hi** | 1434543 | 57131 |
 +| **tides.train+dictfilt-en**  |  1227614  |  48349  |
 +| **tides.train+dictfilt-hi**  |  1313857  |  53700  |
 +| **tides.train+DP11+dictfilt-en**  |  1404006  |  53219  |
 +| **tides.train+DP11+dictfilt-hi**  |  1435965  |  57366  |
 +| **set1-en**              247399 | 20869 |
 +| **set1-hi**              201266 | 16442 |
 +| **set2-en**              368732 | 23316 |
 +| **set2-hi**              328553 | 20769 |
 +| **set3-en**              303059 | 21819 |
 +| **set3-hi**              272276 | 18178 |
 | **Tides-dev-en**        |   22485 |  5596 | | **Tides-dev-en**        |   22485 |  5596 |
 | **Tides-dev-hi**        |   24363 |  5642 | | **Tides-dev-hi**        |   24363 |  5642 |
Line 59: Line 83:
  
  
-^         Coverage               ^^^^^^^^^ + - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 
-|                   | **tokens unseen in train**  ||  **types unseen in train**  || + - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 
-|                    //Tides//  |  //Tides+DP//  |  //Tides//  | //Tides+DP//   + - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005 
-| **Tides-test-en** |   369           348        |   363       |       343      | + - dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8), filtered to get rid of the errors.  
-| **Tides-test-hi** |   839           830        |   642       |       633      + 
-| **Tides-dev-en**  |   464           421        |   459       |       418      | +^         Coverage               ^^^^^^^^^^^^^^^ 
-| **Tides-dev-hi**  |   619           607        |   580       |       568      |+|                   | **tokens unseen in train**  |||||||  **types unseen in train**  ||||||| 
 +|                    //Tides//  |  //Tides+DP//  //Tides+dict// |  //Tides+DP+dict// |   //set1//        //set2//        |  //set3//        |  //Tides//      | //Tides+DP//     //Tides+dict// |  //Tides+DP+dict// |  //set1//        |  //set2 //        //set3//        
 +| **Tides-test-en** |   369           348        363 (1.336%)   |  343 (1.262%)      |  2524 (9.290%)    2330 (8.576%)    2429 (8.940%)      363          |       343       |  357 (6.011%)    338 (5.691%)      |  1974 (33.238%)  |  1824 (30.712%)  |  1901 (32.009%)  
 +| **Tides-test-hi** |   839           830       |  836 (2.926%)    828 (2.898%)      |  3480 (12.179%)  |  3233 (11.314%)  |  3310 (11.584%)  |   642           |       633       |  639 (10.882%)  |  631 (10.746%)      2569 (43.750%)  |  2412 (41.076%)  |  2465 (41.979%)  
 +| **Tides-dev-en**  |   464           421       |  462 (2.055%)    419 (1.863%)      |  2072 (9.215%)    1732 (7.703%)    1873 (8.330%)   |   459           |       418       |  457 (8.167%)    416 (7.434%)      |  1750 (31.272%)  |  1498 (26.769%)  |  1608 (28.735%)  
 +| **Tides-dev-hi**  |   619           607       |  618 (2.537%)    606 (2.487%)      | 2946 (12.092%)    2546 (10.450%)  |  2661 (10.922%)  |   580           |       568        579 (10.262%)  |  567 (10.050%)      2325 (41.209%)  |  2037 (36.104%)  |  2129 (37.735%) 
 + 
  

[ Back to the navigation ] [ Back to the content ]