[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
pub-company:icon2009 [2009/10/22 22:05]
zeman Numbers of bytes to make the identification of the version of data more precise.
pub-company:icon2009 [2010/03/24 17:54]
stranak pridany udaje o OOV pro Tides+dictfilt (Shabdanjali)
Line 45: Line 45:
  
 The test data contain 1 reference translation per sentence. The test data contain 1 reference translation per sentence.
 +
 +Our preprocessing of the data included the following steps:
 +
 +  * Further tokenization. Although Tides in the form we got it is roughly tokenized, there were tokens (like "anglo-american") we wished to split into smaller tokens.
 +  * Unicode normalization. E.g., devanagari "z" was rewritten as "j"+"nukta".
 +  * Cleaning. Sometimes a part of a sentence was removed but never a whole sentence, so the number of sentences is still the same.
 +
 +We are planning on making all the cleaning scripts available here. After preprocessing, the size of Tides changed as follows:
 +
 +| **Part** | **Sentences** | **Tokens en** | **Bytes en** | **Tokens hi** | **Bytes hi** |
 +| train |  50,000 |  1,226,144 |  6,499,532 |  1,312,435 |  15,648,541 |
 +| dev |  1,000 |  22,485 |  118,242 |  24,363 |  288,062 |
 +| test |  1,000 |  27,169 |  145,528 |  28,574 |  343,288 |
  
 ===== Out of Vocabulary ===== ===== Out of Vocabulary =====
Line 54: Line 67:
 | **Tides+DP11-train-en** | 1402536 | 52947 | | **Tides+DP11-train-en** | 1402536 | 52947 |
 | **Tides+DP11-train-hi** | 1434543 | 57131 | | **Tides+DP11-train-hi** | 1434543 | 57131 |
 +| **tides.train+dictfilt-en**  |  1227614  |  48349  |
 +| **tides.train+dictfilt-hi**  |  1313857  |  53700  |
 +| **tides.train+DP11+dictfilt-en**  |  1404006  |  53219  |
 +| **tides.train+DP11+dictfilt-hi**  |  1435965  |  57366  |
 | **set1-en**              247399 | 20869 | | **set1-en**              247399 | 20869 |
 | **set1-hi**              201266 | 16442 | | **set1-hi**              201266 | 16442 |
-| **set2-en**             |  247399 20869 |+| **set2-en**             |  368732 23316 | 
 +| **set2-hi**              328553 | 20769 | 
 +| **set3-en**              303059 | 21819 | 
 +| **set3-hi**              272276 | 18178 |
 | **Tides-dev-en**        |   22485 |  5596 | | **Tides-dev-en**        |   22485 |  5596 |
 | **Tides-dev-hi**        |   24363 |  5642 | | **Tides-dev-hi**        |   24363 |  5642 |
 | **Tides-test-en**         27169 |  5939 | | **Tides-test-en**         27169 |  5939 |
 | **Tides-test-hi**         28574 |  5872 | | **Tides-test-hi**         28574 |  5872 |
 +
  
  - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005  - set1 = danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
Line 66: Line 87:
  - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005  - set3 = emille-om+danielpipes-11+agrocorp-11+wikiner2008+wikiner2009+acl2005
  
-^         Coverage               ^^^^^^^^^ +^         Coverage               ^^^^^^^^^^^ 
-|                   | **tokens unseen in train**  |||  **types unseen in train**  ||| +|                   | **tokens unseen in train**  |||||  **types unseen in train**  ||||| 
-|                    //Tides//  |  //Tides+DP// |   //set1//  |  //Tides//  | //Tides+DP// |  //set1// +|                    //Tides//  |  //Tides+DP// |  //Tides+dict// |  //Tides+DP+dict// |   //set1//  |  //set2//  |  //set3//  |  //Tides//  | //Tides+DP// |  //Tides+dict// |  //Tides+DP+dict// |  //set1//  |  //set2 //  |  //set3//  | 
-| **Tides-test-en** |   369           348       |2524 (9.290%)   363             343  |1974 (33.238%) | +| **Tides-test-en** |   369           348        363 (1.336%)  |  343 (1.262%)  |  2524 (9.290%)  2330 (8.576%)  |  2429 (8.940%)  |    363             343  357 (6.011%)  |  338 (5.691%)  |  1974 (33.238%)   1824 (30.712%)  |  1901 (32.009%)  
-| **Tides-test-hi** |   839           830       |3480 (12.179%)  |   642             633  |2569 (43.750%) | +| **Tides-test-hi** |   839           830        836 (2.926%)  |  828 (2.898%)  |  3480 (12.179%)  |  3233 (11.314%)  |  3310 (11.584%)  |   642             633  639 (10.882%)  |  631 (10.746%)  |  2569 (43.750%)   2412 (41.076%)  |  2465 (41.979%)  
-| **Tides-dev-en**  |   464           421       |2072 (9.215%)  |   459             418  |1750 (31.272%) | +| **Tides-dev-en**  |   464           421        2072 (9.215%)  |  1732 (7.703%)  |  1873 (8.330%)  |   459             418  1750 (31.272%)   1498 (26.769%)  |  1608 (28.735%)  
-| **Tides-dev-hi**  |   619           607       |2946 (12.092%)  |   580             568  |2325 (41.209%) |+| **Tides-dev-hi**  |   619           607        2946 (12.092%)  |  2546 (10.450%)  |  2661 (10.922%)  |   580             568  2325 (41.209%)   2037 (36.104%)  |  2129 (37.735%)  |
  

[ Back to the navigation ] [ Back to the content ]