[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
pub-company:icon2009 [2009/10/16 10:09]
zeman vytvořeno
pub-company:icon2009 [2009/10/22 14:22]
zeman Tides statistics.
Line 1: Line 1:
 ====== English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models ====== ====== English-Hindi Translation – Obtaining Mediocre Results with Bad Data and Fancy Models ======
 +
 ====== UNDER CONSTRUCTION! ====== ====== UNDER CONSTRUCTION! ======
  
Line 30: Line 31:
   * Link tables from the paper to concrete settings   * Link tables from the paper to concrete settings
   * Link to the PDF version of the paper; link to Biblio?   * Link to the PDF version of the paper; link to Biblio?
 +
 +===== Data =====
 +
 +==== IIIT Tides ====
 +
 +A dataset originally collected for the DARPA-TIDES surprise-language contest in 2002, later refined at IIIT Hyderabad and provided for the NLP Tools Contest at ICON 2008. For availability, enquire at the IIIT (http://ltrc.iiit.ac.in/). If you have the data, make sure that you have the same number of sentences and tokens. The Hindi side has been provided in two encodings, WX romanization and UTF-8. Although they should be equivalent, we always work with UTF-8.
 +
 +| **Part** | **Sentences** | **Tokens en** | **Tokens hi** |
 +| train |  50,000 |  1,195,436 |  1,287,174 |
 +| dev |  1,000 |  21,842 |  23,851 |
 +| test |  1,000 |  26,537 |  27,979 |
 +
 +The test data contain 1 reference translation per sentence.
 +
 +===== Out of Vocabulary =====
 +No data have been lemmatised, so  all the numbers mean forms.
 +^           Vocabulary Size               ^^^
 +^ data                ^ tokens  ^ types ^
 +| **Tides-train-en**      | 1226144 | 48048 |
 +| **Tides-train-hi**      | 1312435 | 53451 |
 +| **Tides+DP11-train-en** | 1402536 | 52947 |
 +| **Tides+DP11-train-hi** | 1434543 | 57131 |
 +| **Tides-dev-en**        |   22485 |  5596 |
 +| **Tides-dev-hi**        |   24363 |  5642 |
 +| **Tides-test-en**         27169 |  5939 |
 +| **Tides-test-hi**         28574 |  5872 |
 +
 +
 +^         Coverage               ^^^^^^^^^
 +|                   | **tokens seen in train**  ||||  **types seen in train**  ||||
 +|                    //Tides//  ||  //Tides+DP//  ||  //Tides//  || //Tides+DP// ||
 +|                   | abs |  OOV  | abs |    OOV   | abs |  OOV  | abs |    OOV   |
 +| **Tides-test-en** |                            |                            |
 +| **Tides-test-hi** |                            |                            |
 +| **Tides-dev-en**  |                            |                            |
 +| **Tides-dev-hi**  |                            |                            |
 +

[ Back to the navigation ] [ Back to the content ]