[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks [2011/11/20 18:14]
zeman English size.
user:zeman:treebanks [2011/11/20 18:25]
zeman English inside.
Line 1715: Line 1715:
 ==== Inside ==== ==== Inside ====
  
-CoNLL 2007Many function tags were removed from the non-terminals in the phrase-structure representationThe phrase structures were converted to dependency structures using the procedure described in Richard Johansson, Pierre Nugues: [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|Extended constituent-to-dependency conversion for English]]. In: Proceedings of the 16th Nordic Conference on Computational Linguistics (NODALIDA), pp. 105-112, Tartu, Estonia, 2007.+The original Penn Treebank uses the [[:format-penn|Penn MRG ("merged") bracketing format]]CoNLL 2007 uses the [[:format-conll|CoNLL-X format]]; CoNLL 2008 and 2009 format is slightly different (number and meaning of columns).
  
-PDT 1.0 is distributed in the [[::format-csts|CSTS format]]PDT 2.0 uses the [[::format-pml|PML format]]CoNLL 2006 and 2007 uses the [[:format-conll|CoNLL-X format]]CoNLL 2009 format is slightly different (number and meaning of columns). Unlike the other formats, the CSTS format used the ISO-8859-2 character encoding.+Conversion for CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representationThe phrase structures were converted to dependency structures using the procedure described in [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|(Johansson and Nugues, 2007)]].
  
-The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguation. The official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology. +The original Penn Treebank contains non-terminal labels, function tags and part-of-speech tags, all assigned manually. The CoNLL 2009 version contains manual and automatic disambiguation. See above for documentation of the part-of-speech tags. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=en::penn|DZ Interset]] to inspect the tagset.
- +
-The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columns, CPOS, POS and FEAT. In addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Hana and Zeman, 2005]]). +
- +
-See above for documentation of the morphological tags. All CoNLL distributions contain a README file with a brief description of the parts of speech and features. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=cs::pdt|DZ Interset]] to inspect the PDT and the CoNLL tagsets. +
- +
-The guidelines for syntactic annotation are documented in the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|PDT annotation manual]].+
  
 ==== Sample ==== ==== Sample ====

[ Back to the navigation ] [ Back to the content ]