Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks [2011/11/20 18:02] zeman English domain. |
user:zeman:treebanks [2011/11/20 18:25] zeman English inside. |
||
---|---|---|---|
Line 1707: | Line 1707: | ||
==== Size ==== | ==== Size ==== | ||
- | CoNLL 2007: Wall Street Journal part of the Penn Treebank, sections 2-11 used for training, a subset of section 23 for testing. | + | Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. |
- | + | ||
- | All distributions of PDT are officially split to training, development (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer), those that are annotated both morphologically and analytically (surface syntax; M+A layers), and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subset, which is relevant for surface dependency parsing. | + | |
- | + | ||
- | Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. | + | |
- | + | ||
- | Parts of the following table have been taken from [[http:// | + | |
^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ | ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ | ||
- | | PDT 0.5 | 19126 | 327,597 | 3697 | 63718 | 3787 | 65390 | 26610 | 456,705 | 17.16 | | + | | CoNLL 2007 | |
- | | PDT 1.0 | 73088 | 1,255,590 | 7319 | 126,030 | 7507 | 125,713 | 87914 | 1,489,748 | 16.95 | | + | | CoNLL 2009 | |
- | | PDT 2.0 | 68562 | 1,172,299 | 9270 | 158,962 | 10148 | 173,586 | 87980 | 1,504,847 | 17.10 | | + | |
- | | CoNLL 2006 | 72703 | 1,249,408 | 365 | 5853 | | | 73068 | 1,255,261 | 17.18 | | + | |
- | | CoNLL 2007 | | + | |
- | | CoNLL 2009 | | + | |
==== Inside ==== | ==== Inside ==== | ||
- | CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representation. | + | The original Penn Treebank uses the [[:format-penn|Penn MRG (" |
- | + | ||
- | PDT 1.0 is distributed in the [[:: | + | |
- | + | ||
- | The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguation. The official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology. | + | |
- | + | ||
- | The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columns, CPOS, POS and FEAT. In addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see [[http:// | + | |
- | See above for documentation of the morphological | + | Conversion |
- | The guidelines | + | The original Penn Treebank contains non-terminal labels, function tags and part-of-speech tags, all assigned manually. The CoNLL 2009 version contains manual and automatic disambiguation. See above for documentation of the part-of-speech tags. Use [[http://quest.ms.mff.cuni.cz/ |
==== Sample ==== | ==== Sample ==== |