[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks:ru [2012/01/13 17:49]
zeman References.
user:zeman:treebanks:ru [2012/01/13 18:00]
zeman Domain and size.
Line 36: Line 36:
 ==== Domain ==== ==== Domain ====
  
-Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LBi.eCatalan and Spanish]”; “EPECa corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).+Uppsala University Corpus of contemporary Russian prose (balanced fiction-journalistic, + small percentage of scientific and popular science). In addition, several hundred short texts published in 2001-2002 on various Internet news portals” (yandex.rurbc.rupolit.ru, lenta.rustrana.ru, news.ru etc.)
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2007 dataset was officially split into training and test partThe data split of BDT-II was provided by Koldo Gojenola and should correspond to data split used in parsing experiments published by the IXA Group. +There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (Vyzhivshij_kamikadze, 402 sentences, 3458 tokens) as the test datawhile the rest serves for training.
- +
-^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ +
-| CoNLL 2007 |  3190 |  50526 |  334 |  5390 |              |   3524 |    55916 |  15.87 | +
-| BDT-II |  9094 |  124,684 |  1010 |  12625 |  1122 |  14295 |  11226 |  151,604 |  13.50 |+
  
 ==== Inside ==== ==== Inside ====
 +
 +We have a Treex reader for the Syntagrus native format (.tgt). Note however that Dan converted the original windows-1251 encoding to utf-8.
  
 Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format. Both versions (CoNLL 2007 and BDT-II) are in the CoNLL 2006/2007 format.

[ Back to the navigation ] [ Back to the content ]