[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
pub-company:icon2009 [2009/10/22 16:53]
stranak
pub-company:icon2009 [2010/03/25 10:01] (current)
stranak
Line 33: Line 33:
  
 ===== Data ===== ===== Data =====
 +
  
 ==== IIIT Tides ==== ==== IIIT Tides ====
Line 38: Line 39:
 A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8. A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
  
-| **Part** | **Sentences** | **Tokens en** | **Tokens hi** | +| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** | 
-| train |  50,000 |  1,195,436 |  1,287,174 | +| train |  50,000 |  1,195,436 |  6,496,995 |  1,287,174 |  15,917,598 
-| dev |  1,000 |  21,842 |  23,851 | +| dev |  1,000 |  21,842 |  118,239 |  23,851 |  291,147 
-| test |  1,000 |  26,537 |  27,979 |+| test |  1,000 |  26,537 |  145,376 |  27,979 |  348,221 |
  
 The test data contain 1 reference translation per sentence. The test data contain 1 reference translation per sentence.
 +
 +Our preprocessing of the data included the following steps:
 +
 +  * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like "anglo-american") we wished to split into smaller tokens.
 +  * Unicode normalization. E.g., devanagari "z" was rewritten as "j"+"nukta".
 +  * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.
 +
 +We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:
 +
 +| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
 +| train |  50,000 |  1,226,144 |  6,499,532 |  1,312,435 |  15,648,541 |
 +| dev |  1,000 |  22,485 |  118,242 |  24,363 |  288,062 |
 +| test |  1,000 |  27,169 |  145,528 |  28,574 |  343,288 |
  
 ===== Out of Vocabulary ===== ===== Out of Vocabulary =====
Line 53: Line 67:
 | **Tides+DP11-train-en** | 1402536 | 52947 | | **Tides+DP11-train-en** | 1402536 | 52947 |
 | **Tides+DP11-train-hi** | 1434543 | 57131 | | **Tides+DP11-train-hi** | 1434543 | 57131 |
 +| **tides.train+dictfilt-en**  |  1227614  |  48349  |
 +| **tides.train+dictfilt-hi**  |  1313857  |  53700  |
 +| **tides.train+DP11+dictfilt-en**  |  1404006  |  53219  |
 +| **tides.train+DP11+dictfilt-hi**  |  1435965  |  57366  |
 | **set1-en**              247399 | 20869 | | **set1-en**              247399 | 20869 |
 | **set1-hi**              201266 | 16442 | | **set1-hi**              201266 | 16442 |
-| **set2-en**             |  247399 20869 |+| **set2-en**             |  368732 23316 | 
 +| **set2-hi**              328553 | 20769 | 
 +| **set3-en**              303059 | 21819 | 
 +| **set3-hi**              272276 | 18178 |
 | **Tides-dev-en**        |   22485 |  5596 | | **Tides-dev-en**        |   22485 |  5596 |
 | **Tides-dev-hi**        |   24363 |  5642 | | **Tides-dev-hi**        |   24363 |  5642 |
 | **Tides-test-en**         27169 |  5939 | | **Tides-test-en**         27169 |  5939 |
 | **Tides-test-hi**         28574 |  5872 | | **Tides-test-hi**         28574 |  5872 |
 +
  
  - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005  - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
  - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005  - set2 = emille-11+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
  - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005  - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
 + - dictfilt = Shabdanjali from the web (with many errors, probably from wx-to-utf8). Filtered to get rid of the errors, then expanded entries with multiple meanings to separate entries, then filtered to keep onlu word that occur in the large Hindi monolingual corpus.
 +
 +^         Coverage               ^^^^^^^^^^^^^^^
 +|                   | **tokens unseen in train**  |||||||  **types unseen in train**  |||||||
 +|                    //Tides//  |  //Tides+DP// |  //Tides+dict// |  //Tides+DP+dict// |   //set1//        //set2//        |  //set3//        |  //Tides//      | //Tides+DP//    |  //Tides+dict// |  //Tides+DP+dict// |  //set1//        |  //set2 //        //set3//        |
 +| **Tides-test-en** |   369           348        363 (1.336%)    343 (1.262%)      |  2524 (9.290%)    2330 (8.576%)    2429 (8.940%)      363          |       343        357 (6.011%)    338 (5.691%)      |  1974 (33.238%)  |  1824 (30.712%)  |  1901 (32.009%)  |
 +| **Tides-test-hi** |   839           830        836 (2.926%)    828 (2.898%)      |  3480 (12.179%)  |  3233 (11.314%)  |  3310 (11.584%)  |   642                 633        639 (10.882%)  |  631 (10.746%)      2569 (43.750%)  |  2412 (41.076%)  |  2465 (41.979%)  |
 +| **Tides-dev-en**  |   464           421        462 (2.055%)    419 (1.863%)      |  2072 (9.215%)    1732 (7.703%)    1873 (8.330%)     459                 418        457 (8.167%)    416 (7.434%)      |  1750 (31.272%)  |  1498 (26.769%)  |  1608 (28.735%)  |
 +| **Tides-dev-hi**  |   619           607        618 (2.537%)    606 (2.487%)      | 2946 (12.092%)    2546 (10.450%)  |  2661 (10.922%)  |   580                 568        579 (10.262%)  |  567 (10.050%)      2325 (41.209%)  |  2037 (36.104%)  |  2129 (37.735%)  |
 +
  
-^         Coverage               ^^^^^^^^^ 
-|                   | **tokens unseen in train**  |||  **types unseen in train**  ||| 
-|                    //Tides//  |  //Tides+DP// |  //set1//  |  //Tides//  | //Tides+DP// | //set1// | 
-| **Tides-test-en** |   369           348        2524 (9.290%)  |   363             343  | 1974 (33.238%) | 
-| **Tides-test-hi** |   839           830        3480 (12.179%)  |   642             633  | 2569 (43.750%) | 
-| **Tides-dev-en**  |   464           421        2072 (9.215%)  |   459             418  | 1750 (31.272%) | 
-| **Tides-dev-hi**  |   619           607        2946 (12.092%)  |   580             568  | 2325 (41.209%) | 
  

[ Back to the navigation ] [ Back to the content ]