Next revision
|
Previous revision
|
user:zeman:treebanks:cs [2011/11/20 21:17] zeman vytvořeno |
user:zeman:treebanks:cs [2014/04/04 14:54] (current) zeman PDT 3.0 size. |
* PDT 1.0 (2001) | * PDT 1.0 (2001) |
* PDT 2.0 (2006) | * PDT 2.0 (2006) |
| * PDT 2.5 (2011) |
| * PDT 3.0 (2013) |
* CoNLL 2006 | * CoNLL 2006 |
* CoNLL 2007 | * CoNLL 2007 |
| |
The CoNLL 2006, 2007 and 2009 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared. | The CoNLL 2006, 2007 and 2009 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared. |
| |
| The PDT 2.5 version is available for download from the [[https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0006-DB11-8|LINDAT CLARIN repository]] under the [[http://creativecommons.org/licenses/by-nc-sa/3.0/|Creative Commons 3.0-BY-NC-SA license]], i.e.: |
| |
| * non-commercial usage |
| * redistribution possible under the same license |
| * cite in publications |
| |
PDT was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. The CoNLL 2006 conversion of the treebank was prepared by Yuval Krymolowski; the CoNLL 2007 and 2009 conversions were prepared by ÚFAL (Zdeněk Žabokrtský and Jan Štěpánek). | PDT was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. The CoNLL 2006 conversion of the treebank was prepared by Yuval Krymolowski; the CoNLL 2007 and 2009 conversions were prepared by ÚFAL (Zdeněk Žabokrtský and Jan Štěpánek). |
| |
* Website | * Website |
| * http://ufal.mff.cuni.cz/pdt3.0/ |
| * http://ufal.mff.cuni.cz/pdt2.5/ |
* http://ufal.mff.cuni.cz/pdt2.0/ | * http://ufal.mff.cuni.cz/pdt2.0/ |
| * http://ufal.mff.cuni.cz/pdt/ |
* Data | * Data |
* Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0. | * Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0. |
* Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4. | * Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4. |
| * Eduard Bejček, Jan Hajič, Jarmila Panevová, Jiří Mírovský, Johanka Spoustová, Jan Štěpánek, Pavel Straňák, Pavel Šidák, Pavlína Vimmrová, Eva Šťastná, Magda Ševčíková, Lenka Smejkalová, Petr Homola, Jan Popelka, Markéta Lopatková, Lucie Hrabalová, Natalia Klyueva, Zdeněk Žabokrtský: //Prague Dependency Treebank 2.5//. Univerzita Karlova v Praze, Praha, Czechia, 2011-12-06, http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8 |
* Principal publications | * Principal publications |
* Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000. | * Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000. |
| * Eduard Bejček, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek, Zdeněk Žabokrtský: [[http://aclweb.org/anthology/C/C12/C12-1015.pdf|Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0]]. In: //Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012),// pages 231-246, Mumbai, India, 2012. |
* Documentation | * Documentation |
* Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005. | * Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005. |
* Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999. | * Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999. |
| * Wiki: [[internal:pdt30|Chyby v PDT 3.0]] |
| |
==== Domain ==== | ==== Domain ==== |
| |
Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones). | Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones). |
| |
| PDT 3.0 also distinguishes d-test and e-test but I currently have counts from train and d-test summed up. To be updated... |
| |
^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ | ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ |
| PDT 1.0 | 73088 | 1,255,590 | 7319 | 126,030 | 7507 | 125,713 | 87914 | 1,489,748 | 16.95 | | | PDT 1.0 | 73088 | 1,255,590 | 7319 | 126,030 | 7507 | 125,713 | 87914 | 1,489,748 | 16.95 | |
| PDT 2.0 | 68562 | 1,172,299 | 9270 | 158,962 | 10148 | 173,586 | 87980 | 1,504,847 | 17.10 | | | PDT 2.0 | 68562 | 1,172,299 | 9270 | 158,962 | 10148 | 173,586 | 87980 | 1,504,847 | 17.10 | |
| | PDT 3.0 | 77765 | 1,330,152 | train | train | 10148 | 173,586 | 87913 | 1,503,738 | 17.10 | |
| CoNLL 2006 | 72703 | 1,249,408 | 365 | 5853 | | | 73068 | 1,255,261 | 17.18 | | | CoNLL 2006 | 72703 | 1,249,408 | 365 | 5853 | | | 73068 | 1,255,261 | 17.18 | |
| CoNLL 2007 | 25364 | 432,296 | 286 | 4724 | | | 25650 | 437,020 | 17.04 | | | CoNLL 2007 | 25364 | 432,296 | 286 | 4724 | | | 25650 | 437,020 | 17.04 | |