Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks [2011/11/19 23:23] zeman Greek parsing. |
user:zeman:treebanks [2011/11/20 18:53] zeman Nějak už se to sem nevejde. |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Treebanks for Various Languages ====== | ====== Treebanks for Various Languages ====== | ||
+ | |||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
+ | * [[user: | ||
===== Arabic (ar) ===== | ===== Arabic (ar) ===== | ||
Line 1661: | Line 1671: | ||
The two Malt parser results of 2007 (single malt and blended) are described in [[http:// | The two Malt parser results of 2007 (single malt and blended) are described in [[http:// | ||
+ | |||
+ | ===== English (en) ===== | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | ==== Versions ==== | ||
+ | |||
+ | * Penn Treebank 2 (1995) | ||
+ | * Penn Treebank 3 (1999) | ||
+ | * CoNLL 2007 | ||
+ | * CoNLL 2008 | ||
+ | * CoNLL 2009 | ||
+ | |||
+ | ==== Obtaining and License ==== | ||
+ | |||
+ | The original Penn Treebank is distributed by the LDC under the catalogue number [[http:// | ||
+ | |||
+ | * non-commercial education and research usage | ||
+ | * no redistribution | ||
+ | * citation in publications not explicitly required but it is common decency | ||
+ | |||
+ | The CoNLL 2007, 2008 and 2009 versions are also licensed by the LDC and LDC members can keep them after the shared task. Those who have not participated in the shared task may inquire at the LDC about the availability of the datasets. Their republication in LDC is planned but it has not happenned yet. | ||
+ | |||
+ | The Penn Treebank was created by members of the [[http:// | ||
+ | |||
+ | ==== References ==== | ||
+ | |||
+ | * Website | ||
+ | * http:// | ||
+ | * Data | ||
+ | * Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, | ||
+ | * Principal publications | ||
+ | * Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz: | ||
+ | * Documentation | ||
+ | * [[http:// | ||
+ | * Beatrice Santorini: [[ftp:// | ||
+ | * Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre: [[ftp:// | ||
+ | * Robert MacIntyre: [[ftp:// | ||
+ | * Richard Johansson, Pierre Nugues: [[http:// | ||
+ | |||
+ | ==== Domain ==== | ||
+ | |||
+ | Financial news from the Wall Street Journal (1989). The constituent-based Treebank-3 also contains parsed versions of ATIS-3 and of the Brown Corpus. Only WSJ texts have been converted to dependencies for the CoNLL shared tasks. | ||
+ | |||
+ | ==== Size ==== | ||
+ | |||
+ | Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. Sections 2-11 of the Wall Street Journal part of the treebank were used for training and a subset of section 23 was used for testing. | ||
+ | |||
+ | ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ | ||
+ | | CoNLL 2007 | 18577 | 446,573 | 214 | 5003 | | | 18791 | 451,576 | 24.03 | | ||
+ | | CoNLL 2009 | 39279 | 958,167 | 1334 | 33368 | 2399 | 57676 | 43012 | 1,049,211 | 24.39 | | ||
+ | |||
+ | ==== Inside ==== | ||
+ | |||
+ | The original Penn Treebank uses the [[: | ||
+ | |||
+ | Conversion for CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representation. The phrase structures were converted to dependency structures using the procedure described in [[http:// | ||
+ | |||
+ | The original Penn Treebank contains non-terminal labels, function tags and part-of-speech tags, all assigned manually. The CoNLL 2009 version contains manual and automatic disambiguation. See above for documentation of the part-of-speech tags. Use [[http:// | ||