[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:ru [2012/01/13 18:04]
zeman Sample.
user:zeman:treebanks:ru [2012/01/13 21:33]
zeman Documentation.
Line 33: Line 33:
   * Documentation   * Documentation
     * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]].     * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]].
 +    * Daniel Zeman: {{:user:zeman:treebanks:russian_dependency_treebank.pdf|Russian Dependency Treebank}} //(written based on a Russian document pamjatka_korpus.doc)//, College Park, Maryland, USA, 2006
  
 ==== Domain ==== ==== Domain ====
Line 40: Line 41:
 ==== Size ==== ==== Size ====
  
-There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (Vyzhivshij_kamikadze, 402 sentences, 3458 tokens) as the test data, while the rest serves for training.+There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, we take one file (''Выживший_камикадзе.tgt'', 402 sentences, 3458 tokens) as the test data, while the rest serves for training.
  
 ==== Inside ==== ==== Inside ====
  
-We have a Treex reader for the Syntagrus native format (.tgt)Note however that Dan converted the original windows-1251 encoding to utf-8. +The native file format of Syntagrus is the XML-based ''.tgt'' formatIt uses the Windows-1251 encoding, which can be converted to UTF-8. Converting the file names is more of a challenge, depending on the file system (one typically gets a zipped archive containing files whose names use the Cyrillic alphabet, and the file system may store the names in a codepage different from Windows-1251).
- +
-Both versions (CoNLL 2007 and BDT-IIare in the CoNLL 2006/2007 format.+
  
 Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!): Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):

[ Back to the navigation ] [ Back to the content ]