Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:ru [2012/01/13 17:26] zeman vytvořeno |
user:zeman:treebanks:ru [2012/01/13 18:00] zeman Domain and size. |
||
---|---|---|---|
Line 1: | Line 1: | ||
===== Russian (ru) ===== | ===== Russian (ru) ===== | ||
- | Russian Dependency Treebank (RDT, Syntagrus) | + | [[http:// |
==== Versions ==== | ==== Versions ==== | ||
Line 7: | Line 7: | ||
* 2006 (small part obtained by Dan Zeman per e-mail from Igor Boguslavsky) | * 2006 (small part obtained by Dan Zeman per e-mail from Igor Boguslavsky) | ||
* 2009 (newer and larger version obtained by Natalia Klyueva) | * 2009 (newer and larger version obtained by Natalia Klyueva) | ||
+ | * The version at the site of the [[http:// | ||
==== Obtaining and License ==== | ==== Obtaining and License ==== | ||
Line 26: | Line 27: | ||
* //no separate citation// | * //no separate citation// | ||
* Principal publications | * Principal publications | ||
- | * Itziar Aduriz, María Jesús Aranzabe, José María Arriola, Aitziber Atutxa, Arantza Díaz de Ilarraza, Aitzpea Garmendia, Maite Oronoz: [[http://w3.msi.vxu.se/ | + | * Igor Boguslavsky, Ivan Chardin, Svetlana Grigorieva, Nikolai Grigoriev, Leonid Iomdin, Leonid Kreidlin, Nadezhda Frid: [[http://cl.iitp.ru/bibitems/treebank_lrec.pdf|Development |
+ | * Other publications | ||
+ | * Joakim Nivre, Igor M. Boguslavsky, | ||
+ | * David Mareček, Natalia Kljueva: [[http:// | ||
* Documentation | * Documentation | ||
- | * Description of tags and feature values is hard to find; the '' | + | * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]]. |
- | * María Jesús Aranzabe, José Mari Arriola, Aitziber Atutxa, Irene Balza, Larraitz Uria: [[http:// | + | |
- | * [[http:// | + | |
- | * José Ignacio Hualde, Jon Ortiz de Urbina: [[http:// | + | |
==== Domain ==== | ==== Domain ==== | ||
- | Newswire + unknown | + | Uppsala University Corpus of contemporary Russian prose (balanced fiction-journalistic, |
==== Size ==== | ==== Size ==== | ||
- | The CoNLL 2007 dataset | + | There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. |
- | + | ||
- | ^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ | + | |
- | | CoNLL 2007 | 3190 | 50526 | 334 | 5390 | | + | |
- | | BDT-II | 9094 | 124,684 | 1010 | 12625 | 1122 | 14295 | 11226 | 151,604 | 13.50 | | + | |
==== Inside ==== | ==== Inside ==== | ||
+ | |||
+ | We have a Treex reader for the Syntagrus native format (.tgt). Note however that Dan converted the original windows-1251 encoding to utf-8. | ||
Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format. | Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format. |