===== Czech (cs) =====
[[http://ufal.mff.cuni.cz/pdt2.0/|Prague Dependency Treebank]] (PDT)
==== Versions ====
* PDT 0.5 (1998)
* PDT 1.0 (2001)
* PDT 2.0 (2006)
* PDT 2.5 (2011)
* PDT 3.0 (2013)
* CoNLL 2006
* CoNLL 2007
* CoNLL 2009
The CoNLL 2006 version is based on PDT 1.0. The CoNLL 2007 and 2009 versions are based on PDT 2.0.
==== Obtaining and License ====
The original PDT 1.0 and 2.0 is distributed by the LDC under the catalogue numbers [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]] and [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]. It is free for LDC members 2001 and 2006, price for non-members is unknown (contact LDC). The [[http://ufal.mff.cuni.cz/corp-lic/pdt20-ord.html|license]] in short:
* non-commercial research usage
* no redistribution
* include in publications: “The Prague Dependency Treebank, version 2.0 has been developed by the Institute of Formal and Applied Linguistics, http://ufal.mff.cuni.cz/.”
The CoNLL 2006, 2007 and 2009 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared.
The PDT 2.5 version is available for download from the [[https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0006-DB11-8|LINDAT CLARIN repository]] under the [[http://creativecommons.org/licenses/by-nc-sa/3.0/|Creative Commons 3.0-BY-NC-SA license]], i.e.:
* non-commercial usage
* redistribution possible under the same license
* cite in publications
PDT was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. The CoNLL 2006 conversion of the treebank was prepared by Yuval Krymolowski; the CoNLL 2007 and 2009 conversions were prepared by ÚFAL (Zdeněk Žabokrtský and Jan Štěpánek).
==== References ====
* Website
* http://ufal.mff.cuni.cz/pdt3.0/
* http://ufal.mff.cuni.cz/pdt2.5/
* http://ufal.mff.cuni.cz/pdt2.0/
* http://ufal.mff.cuni.cz/pdt/
* Data
* Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0.
* Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4.
* Eduard Bejček, Jan Hajič, Jarmila Panevová, Jiří Mírovský, Johanka Spoustová, Jan Štěpánek, Pavel Straňák, Pavel Šidák, Pavlína Vimmrová, Eva Šťastná, Magda Ševčíková, Lenka Smejkalová, Petr Homola, Jan Popelka, Markéta Lopatková, Lucie Hrabalová, Natalia Klyueva, Zdeněk Žabokrtský: //Prague Dependency Treebank 2.5//. Univerzita Karlova v Praze, Praha, Czechia, 2011-12-06, http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
* Principal publications
* Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000.
* Eduard Bejček, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek, Zdeněk Žabokrtský: [[http://aclweb.org/anthology/C/C12/C12-1015.pdf|Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0]]. In: //Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012),// pages 231-246, Mumbai, India, 2012.
* Documentation
* Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005.
* Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999.
* Wiki: [[internal:pdt30|Chyby v PDT 3.0]]
==== Domain ====
Newswire text (Lidové noviny, Mladá fronta Dnes), business weekly (Českomoravský Profit) and a scientific magazine (Vesmír).
==== Size ====
All distributions of PDT are officially split to training, development (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer), those that are annotated both morphologically and analytically (surface syntax; M+A layers), and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subset, which is relevant for surface dependency parsing.
Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. For CoNLL 2009, only that part of PDT was selected that contained also tectogrammatical annotation, because the 2009 task included semantic learning.
Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones).
PDT 3.0 also distinguishes d-test and e-test but I currently have counts from train and d-test summed up. To be updated...
^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^
| PDT 0.5 | 19126 | 327,597 | 3697 | 63718 | 3787 | 65390 | 26610 | 456,705 | 17.16 |
| PDT 1.0 | 73088 | 1,255,590 | 7319 | 126,030 | 7507 | 125,713 | 87914 | 1,489,748 | 16.95 |
| PDT 2.0 | 68562 | 1,172,299 | 9270 | 158,962 | 10148 | 173,586 | 87980 | 1,504,847 | 17.10 |
| PDT 3.0 | 77765 | 1,330,152 | train | train | 10148 | 173,586 | 87913 | 1,503,738 | 17.10 |
| CoNLL 2006 | 72703 | 1,249,408 | 365 | 5853 | | | 73068 | 1,255,261 | 17.18 |
| CoNLL 2007 | 25364 | 432,296 | 286 | 4724 | | | 25650 | 437,020 | 17.04 |
| CoNLL 2009 | 38727 | 652,544 | 5228 | 87988 | 4213 | 70348 | 48168 | 810,880 | 16.83 |
==== Inside ====
PDT 1.0 is distributed in the [[::format-csts|CSTS format]]. PDT 2.0 uses the [[::format-pml|PML format]]. CoNLL 2006 and 2007 uses the [[:format-conll|CoNLL-X format]]; CoNLL 2009 format is slightly different (number and meaning of columns). Unlike the other formats, the CSTS format used the ISO-8859-2 character encoding.
The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguation. The official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology.
The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columns, CPOS, POS and FEAT. In addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Hana and Zeman, 2005]]).
See above for documentation of the morphological tags. All CoNLL distributions contain a README file with a brief description of the parts of speech and features. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=cs::pdt|DZ Interset]] to inspect the PDT and the CoNLL tagsets.
The guidelines for syntactic annotation are documented in the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|PDT annotation manual]].
==== Sample ====
The first sentence of the PDT 1.0 training data:
Českomoravský profitjs
1996-2000
Manual analytical annotation
kk,lk
1996-2000
Manual morphological annotation
s
inf
mix
j
1994
y
cmpr9406
001