[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks [2011/11/20 18:02]
zeman English domain.
user:zeman:treebanks [2011/11/20 18:25]
zeman English inside.
Line 1707: Line 1707:
 ==== Size ==== ==== Size ====
  
-CoNLL 2007: Wall Street Journal part of the Penn Treebank, sections 2-11 used for training, a subset of section 23 for testing. +Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. Sections 2-11 of the Wall Street Journal part of the treebank were used for training and a subset of section 23 was used for testing.
- +
-All distributions of PDT are officially split to training, development (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer), those that are annotated both morphologically and analytically (surface syntax; M+A layers), and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subset, which is relevant for surface dependency parsing. +
- +
-Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. For CoNLL 2009, only that part of PDT was selected that contained also tectogrammatical annotation, because the 2009 task included semantic learning. +
- +
-Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones).+
  
 ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^
-| PDT 0.5 |     19126 |    327,597 |  3697 |    63718 |   3787 |    65390 |  26610 |    456,705 |  17.16 | +| CoNLL 2007 |  18577 |    446,573 |   214 |     5003 |        |          |  18791 |    451,576 |  24.03 
-| PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 | +| CoNLL 2009 |  39279 |    958,167 |  1334 |    33368 |   2399 |    57676 |  43012  1,049,211 |  24.39 |
-| PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 | +
-| CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 | +
-| CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 +
-| CoNLL 2009 |  38727 |    652,544 |  5228 |    87988 |   4213 |    70348 |  48168    810,880 |  16.83 |+
  
 ==== Inside ==== ==== Inside ====
  
-CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representation. The phrase structures were converted to dependency structures using the procedure described in Richard Johansson, Pierre Nugues: [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|Extended constituent-to-dependency conversion for English]]. In: Proceedings of the 16th Nordic Conference on Computational Linguistics (NODALIDA), pp. 105-112, Tartu, Estonia, 2007. +The original Penn Treebank uses the [[:format-penn|Penn MRG ("merged"bracketing format]]. CoNLL 2007 uses the [[:format-conll|CoNLL-X format]]; CoNLL 2008 and 2009 format is slightly different (number and meaning of columns).
- +
-PDT 1.0 is distributed in the [[::format-csts|CSTS format]]. PDT 2.0 uses the [[::format-pml|PML format]]. CoNLL 2006 and 2007 uses the [[:format-conll|CoNLL-X format]]; CoNLL 2009 format is slightly different (number and meaning of columns). Unlike the other formats, the CSTS format used the ISO-8859-2 character encoding. +
- +
-The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguation. The official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology. +
- +
-The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columns, CPOS, POS and FEAT. In addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Hana and Zeman, 2005]]).+
  
-See above for documentation of the morphological tags. All CoNLL distributions contain a README file with a brief description of the parts of speech and featuresUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=cs::pdt|DZ Interset]] to inspect the PDT and the CoNLL tagsets.+Conversion for CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representationThe phrase structures were converted to dependency structures using the procedure described in [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|(Johansson and Nugues, 2007)]].
  
-The guidelines for syntactic annotation are documented in the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|PDT annotation manual]].+The original Penn Treebank contains non-terminal labels, function tags and part-of-speech tags, all assigned manually. The CoNLL 2009 version contains manual and automatic disambiguation. See above for documentation of the part-of-speech tags. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=en::penn|DZ Interset]] to inspect the tagset.
  
 ==== Sample ==== ==== Sample ====

[ Back to the navigation ] [ Back to the content ]