[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks [2011/11/20 18:02]
zeman English domain.
user:zeman:treebanks [2011/11/20 18:14]
zeman English size.
Line 1707: Line 1707:
 ==== Size ==== ==== Size ====
  
-CoNLL 2007: Wall Street Journal part of the Penn Treebank, sections 2-11 used for training, a subset of section 23 for testing. +Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. Sections 2-11 of the Wall Street Journal part of the treebank were used for training and a subset of section 23 was used for testing.
- +
-All distributions of PDT are officially split to training, development (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer), those that are annotated both morphologically and analytically (surface syntax; M+A layers), and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subset, which is relevant for surface dependency parsing. +
- +
-Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. For CoNLL 2009, only that part of PDT was selected that contained also tectogrammatical annotation, because the 2009 task included semantic learning. +
- +
-Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones).+
  
 ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^
-| PDT 0.5 |     19126 |    327,597 |  3697 |    63718 |   3787 |    65390 |  26610 |    456,705 |  17.16 | +| CoNLL 2007 |  18577 |    446,573 |   214 |     5003 |        |          |  18791 |    451,576 |  24.03 
-| PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 | +| CoNLL 2009 |  39279 |    958,167 |  1334 |    33368 |   2399 |    57676 |  43012  1,049,211 |  24.39 |
-| PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 | +
-| CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 | +
-| CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 +
-| CoNLL 2009 |  38727 |    652,544 |  5228 |    87988 |   4213 |    70348 |  48168    810,880 |  16.83 |+
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]