[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:treebanks:cs [2011/11/20 21:17]
zeman vytvořeno
user:zeman:treebanks:cs [2014/04/04 14:54] (current)
zeman PDT 3.0 size.
Line 8: Line 8:
   * PDT 1.0 (2001)   * PDT 1.0 (2001)
   * PDT 2.0 (2006)   * PDT 2.0 (2006)
 +  * PDT 2.5 (2011)
 +  * PDT 3.0 (2013)
   * CoNLL 2006   * CoNLL 2006
   * CoNLL 2007   * CoNLL 2007
Line 23: Line 25:
  
 The CoNLL 2006, 2007 and 2009 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared. The CoNLL 2006, 2007 and 2009 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared.
 +
 +The PDT 2.5 version is available for download from the [[https://ufal-point.mff.cuni.cz/xmlui/handle/11858/00-097C-0000-0006-DB11-8|LINDAT CLARIN repository]] under the [[http://creativecommons.org/licenses/by-nc-sa/3.0/|Creative Commons 3.0-BY-NC-SA license]], i.e.:
 +
 +  * non-commercial usage
 +  * redistribution possible under the same license
 +  * cite in publications
  
 PDT was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. The CoNLL 2006 conversion of the treebank was prepared by Yuval Krymolowski; the CoNLL 2007 and 2009 conversions were prepared by ÚFAL (Zdeněk Žabokrtský and Jan Štěpánek). PDT was created by members of the [[http://ufal.mff.cuni.cz/|Institute of Formal and Applied Linguistics]] (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. The CoNLL 2006 conversion of the treebank was prepared by Yuval Krymolowski; the CoNLL 2007 and 2009 conversions were prepared by ÚFAL (Zdeněk Žabokrtský and Jan Štěpánek).
Line 29: Line 37:
  
   * Website   * Website
 +    * http://ufal.mff.cuni.cz/pdt3.0/
 +    * http://ufal.mff.cuni.cz/pdt2.5/
     * http://ufal.mff.cuni.cz/pdt2.0/     * http://ufal.mff.cuni.cz/pdt2.0/
 +    * http://ufal.mff.cuni.cz/pdt/
   * Data   * Data
     * Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0.     * Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0.
     * Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4.     * Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4.
 +    * Eduard Bejček, Jan Hajič, Jarmila Panevová, Jiří Mírovský, Johanka Spoustová, Jan Štěpánek, Pavel Straňák, Pavel Šidák, Pavlína Vimmrová, Eva Šťastná, Magda Ševčíková, Lenka Smejkalová, Petr Homola, Jan Popelka, Markéta Lopatková, Lucie Hrabalová, Natalia Klyueva, Zdeněk Žabokrtský: //Prague Dependency Treebank 2.5//. Univerzita Karlova v Praze, Praha, Czechia, 2011-12-06, http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
   * Principal publications   * Principal publications
     * Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000.     * Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000.
 +    * Eduard Bejček, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek, Zdeněk Žabokrtský: [[http://aclweb.org/anthology/C/C12/C12-1015.pdf|Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0]]. In: //Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012),// pages 231-246, Mumbai, India, 2012.
   * Documentation   * Documentation
     * Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005.     * Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005.
     * Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999.     * Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999.
 +    * Wiki: [[internal:pdt30|Chyby v PDT 3.0]]
  
 ==== Domain ==== ==== Domain ====
Line 50: Line 64:
  
 Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones). Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones).
 +
 +PDT 3.0 also distinguishes d-test and e-test but I currently have counts from train and d-test summed up. To be updated...
  
 ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^
Line 55: Line 71:
 | PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 | | PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 |
 | PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 | | PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 |
 +| PDT 3.0 |     77765 |  1,330,152 | train |    train |  10148 |  173,586 |  87913 |  1,503,738 |  17.10 |
 | CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 | | CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 |
 | CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 | | CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 |

[ Back to the navigation ] [ Back to the content ]