Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision | Next revision Both sides next revision | ||
user:zeman:treebanks:ru [2012/01/13 17:49] zeman References. |
user:zeman:treebanks:ru [2012/01/13 18:00] zeman Domain and size. |
||
---|---|---|---|
Line 36: | Line 36: | ||
==== Domain ==== | ==== Domain ==== | ||
- | Newswire + unknown | + | Uppsala University Corpus of contemporary Russian prose (balanced fiction-journalistic, |
==== Size ==== | ==== Size ==== | ||
- | The CoNLL 2007 dataset | + | There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. |
- | + | ||
- | ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ | + | |
- | | CoNLL 2007 | 3190 | 50526 | 334 | 5390 | | + | |
- | | BDT-II | 9094 | 124,684 | 1010 | 12625 | 1122 | 14295 | 11226 | 151,604 | 13.50 | | + | |
==== Inside ==== | ==== Inside ==== | ||
+ | |||
+ | We have a Treex reader for the Syntagrus native format (.tgt). Note however that Dan converted the original windows-1251 encoding to utf-8. | ||
Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format. | Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format. |