Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:ru [2012/01/13 18:04] zeman Sample. |
user:zeman:treebanks:ru [2012/01/13 21:39] zeman Inside. |
||
---|---|---|---|
Line 32: | Line 32: | ||
* David Mareček, Natalia Kljueva: [[http:// | * David Mareček, Natalia Kljueva: [[http:// | ||
* Documentation | * Documentation | ||
- | * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]]. | + | * Daniel Zeman: {{: |
==== Domain ==== | ==== Domain ==== | ||
Line 40: | Line 40: | ||
==== Size ==== | ==== Size ==== | ||
- | There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, | + | There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, |
==== Inside ==== | ==== Inside ==== | ||
- | We have a Treex reader for the Syntagrus | + | The native |
- | Both versions | + | Morphological annotation has probably been done manually and it contains lemmas |
- | Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!): | + | The syntactic trees do not contain punctuation. Punctuation tokens have not been removed but they do not have independent nodes in the trees. |
- | + | ||
- | * IZE = noun | + | |
- | * ARR = common | + | |
- | * IZB = proper name | + | |
- | * LIB = place name | + | |
- | * ZKI = number | + | |
- | * ADJ = adjective | + | |
- | * ARR = common | + | |
- | * GAL = question | + | |
- | * ADI = verb | + | |
- | * SIN = simple | + | |
- | * ADK = composed | + | |
- | * ADP = periphrastic | + | |
- | * FAK = factitive | + | |
- | * ADB = adverb | + | |
- | * ARR = common | + | |
- | * GAL = question | + | |
- | * DET = determiner | + | |
- | * ERKARR = demonstrative common | + | |
- | * ERKIND = demonstrative emphatic | + | |
- | * NOLARR = indefinite common | + | |
- | * NOLGAL = indefinite question | + | |
- | * ZNB = number | + | |
- | * DZH = definite | + | |
- | * BAN = distributive | + | |
- | * ORD = ordinal | + | |
- | * DZG = indefinite | + | |
- | * ORO = general | + | |
- | * IOR = pronoun | + | |
- | * PERARR = personal common | + | |
- | * PERIND = personal emphatic | + | |
- | * IZGMGB = indefinite | + | |
- | * IZGGAL = question | + | |
- | * BIH = ??? | + | |
- | * ELK = ??? | + | |
- | * LOT = link | + | |
- | * LOK = connector | + | |
- | * JNT = conjunction | + | |
- | * PRT = particle | + | |
- | * ITJ = interjection | + | |
- | * BST = other | + | |
- | * ADL = auxiliary verb | + | |
- | * ADT = synthetic verb | + | |
- | * SIG = acronym | + | |
- | * SNB = symbol | + | |
- | * LAB = abbreviation | + | |
==== Sample ==== | ==== Sample ==== |