[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:cs [2013/07/11 17:40]
zeman PDT 2.5.
user:zeman:treebanks:cs [2014/04/04 14:54] (current)
zeman PDT 3.0 size.
Line 9: Line 9:
   * PDT 2.0 (2006)   * PDT 2.0 (2006)
   * PDT 2.5 (2011)   * PDT 2.5 (2011)
 +  * PDT 3.0 (2013)
   * CoNLL 2006   * CoNLL 2006
   * CoNLL 2007   * CoNLL 2007
Line 36: Line 37:
  
   * Website   * Website
 +    * http://ufal.mff.cuni.cz/pdt3.0/
     * http://ufal.mff.cuni.cz/pdt2.5/     * http://ufal.mff.cuni.cz/pdt2.5/
     * http://ufal.mff.cuni.cz/pdt2.0/     * http://ufal.mff.cuni.cz/pdt2.0/
Line 42: Line 44:
     * Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0.     * Jan Hajič, Eva Hajičová, Petr Pajas, Jarmila Panevová, Petr Sgall: //Prague Dependency Treebank 1.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2001T10|LDC2001T10]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-212-0.
     * Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4.     * Jan Hajič, Eva Hajičová, Jarmila Panevová, Petr Sgall, Petr Pajas, Jan Štěpánek, Jiří Havelka, Marie Mikulová: //Prague Dependency Treebank 2.0// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T01|LDC2006T01]]). Linguistic Data Consortium, Philadelphia, USA, 2006. ISBN 1-58563-370-4.
 +    * Eduard Bejček, Jan Hajič, Jarmila Panevová, Jiří Mírovský, Johanka Spoustová, Jan Štěpánek, Pavel Straňák, Pavel Šidák, Pavlína Vimmrová, Eva Šťastná, Magda Ševčíková, Lenka Smejkalová, Petr Homola, Jan Popelka, Markéta Lopatková, Lucie Hrabalová, Natalia Klyueva, Zdeněk Žabokrtský: //Prague Dependency Treebank 2.5//. Univerzita Karlova v Praze, Praha, Czechia, 2011-12-06, http://hdl.handle.net/11858/00-097C-0000-0006-DB11-8
   * Principal publications   * Principal publications
     * Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000.     * Jan Hajič, Alena Böhmová, Eva Hajičová, Barbora Hladká: [[http://ufal.mff.cuni.cz/pdt2.0/publications/HajicHajicovaAl2000.pdf|The Prague Dependency Treebank: A Three-Level Annotation Scenario]]. In: Anne Abeillé (ed.): //Treebanks: Building and Using Parsed Corpora//, pages 103-127, Kluwer, Amsterdam, The Netherlands, 2000.
-    * Eduard Bejček, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek, Zdeněk Žabokrtský: Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0. In: //Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012),// pages 231-246, Mumbai, India, 2012.+    * Eduard Bejček, Jarmila Panevová, Jan Popelka, Pavel Straňák, Magda Ševčíková, Jan Štěpánek, Zdeněk Žabokrtský: [[http://aclweb.org/anthology/C/C12/C12-1015.pdf|Prague Dependency Treebank 2.5 – a Revisited Version of PDT 2.0]]. In: //Proceedings of the 24th International Conference on Computational Linguistics (Coling 2012),// pages 231-246, Mumbai, India, 2012.
   * Documentation   * Documentation
     * Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005.     * Jiří Hana, Daniel Zeman: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Manual for Morphological Annotation]], Revision for the Prague Dependency Treebank 2.0, ÚFAL Technical Report No. 2005-27, Praha, Czechia, 2005.
     * Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999.     * Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová: [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Annotations at Analytical Level]], Instructions for annotators, ÚFAL MFF UK, Praha, Czechia, 1999.
 +    * Wiki: [[internal:pdt30|Chyby v PDT 3.0]]
  
 ==== Domain ==== ==== Domain ====
Line 60: Line 64:
  
 Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones). Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones).
 +
 +PDT 3.0 also distinguishes d-test and e-test but I currently have counts from train and d-test summed up. To be updated...
  
 ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^
Line 65: Line 71:
 | PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 | | PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 |
 | PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 | | PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 |
 +| PDT 3.0 |     77765 |  1,330,152 | train |    train |  10148 |  173,586 |  87913 |  1,503,738 |  17.10 |
 | CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 | | CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 |
 | CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 | | CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 |

[ Back to the navigation ] [ Back to the content ]