Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:treebanks:ru [2012/01/13 21:33] zeman Documentation. |
user:zeman:treebanks:ru [2012/01/13 21:49] (current) zeman Data split. |
||
---|---|---|---|
Line 32: | Line 32: | ||
* David Mareček, Natalia Kljueva: [[http:// | * David Mareček, Natalia Kljueva: [[http:// | ||
* Documentation | * Documentation | ||
- | * Description of tags and feature values is hard to find; see also the [[# | ||
* Daniel Zeman: {{: | * Daniel Zeman: {{: | ||
Line 47: | Line 46: | ||
The native file format of Syntagrus is the XML-based '' | The native file format of Syntagrus is the XML-based '' | ||
- | Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!): | + | Morphological annotation has probably been done manually and it contains lemmas (uppercased). See references for a description |
- | * IZE = noun | + | The syntactic trees do not contain punctuation. Punctuation tokens have not been removed but they do not have independent nodes in the trees. |
- | * ARR = common | + | |
- | * IZB = proper name | + | |
- | * LIB = place name | + | |
- | * ZKI = number | + | |
- | * ADJ = adjective | + | |
- | * ARR = common | + | |
- | * GAL = question | + | |
- | * ADI = verb | + | |
- | * SIN = simple | + | |
- | * ADK = composed | + | |
- | * ADP = periphrastic | + | |
- | * FAK = factitive | + | |
- | * ADB = adverb | + | |
- | * ARR = common | + | |
- | * GAL = question | + | |
- | * DET = determiner | + | |
- | * ERKARR = demonstrative common | + | |
- | * ERKIND = demonstrative emphatic | + | |
- | * NOLARR = indefinite common | + | |
- | * NOLGAL = indefinite question | + | |
- | * ZNB = number | + | |
- | * DZH = definite | + | |
- | * BAN = distributive | + | |
- | * ORD = ordinal | + | |
- | * DZG = indefinite | + | |
- | * ORO = general | + | |
- | * IOR = pronoun | + | |
- | * PERARR = personal common | + | |
- | * PERIND = personal emphatic | + | |
- | * IZGMGB = indefinite | + | |
- | * IZGGAL = question | + | |
- | * BIH = ??? | + | |
- | * ELK = ??? | + | |
- | * LOT = link | + | |
- | * LOK = connector | + | |
- | * JNT = conjunction | + | |
- | * PRT = particle | + | |
- | * ITJ = interjection | + | |
- | * BST = other | + | |
- | * ADL = auxiliary verb | + | |
- | * ADT = synthetic verb | + | |
- | * SIG = acronym | + | |
- | * SNB = symbol | + | |
- | * LAB = abbreviation | + | |
==== Sample ==== | ==== Sample ==== | ||
Line 125: | Line 80: | ||
==== Parsing ==== | ==== Parsing ==== | ||
- | BDT is a mildly nonprojective treebank. 1925 of the 151,604 tokens | + | Nonprojectivities in SynTagRus are not frequent. Only 4146 of the 497,465 tokens are attached nonprojectively (0.83%). |
- | The results | + | Parsing |
- | ^ Parser | + | ^ Parser ^ LAS ^ UAS ^ |
- | | Malt (Nilsson et al.) | 76.94 | 82.84 | | + | | Malt | 82.3 | 89.1 | |
- | | Titov et al. | 75.49 | 81.93 | | + | |
- | | Sagae | 74.64 | 81.19 | | + | |
- | | Carreras | 75.75 | 81.11 | | + | |
- | | Nakagawa | 72.56 | 81.04 | | + | |
- | | Malt (J. Hall et al.) | 74.99 | 80.61 | | + | |
- | | Johansson et al. | 75.08 | 80.43 | | + | |
- | The two Malt parser results of 2007 (single malt and blended) are described in [[http:// | ||
- | |||
- | Parsing results on BDT-II have been published in Kepa Bengoetxea, Koldo Gojenola: [[http:// |
[ Back to the navigation ] [ Back to the content ]