[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks:eu [2011/11/29 09:38]
zeman License.
user:zeman:treebanks:eu [2011/11/29 10:20]
zeman Size.
Line 36: Line 36:
 ==== Size ==== ==== Size ====
  
-The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences training4804 tokens / 197 sentences test).+The CoNLL 2007 dataset was officially split into training and test part. The data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group. 
 + 
 +^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ 
 +CoNLL 2007 |  3190 |  50526 |  334 |  5390 |              |   3524 |    55916 |  15.87 | 
 +| BDT-II |  9094 |  124,684 |  1010 |  12625 |  1122 |  14295 |  11226 |  151,604 |  13.50 |
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]