Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:ru [2012/01/13 18:00] zeman Domain and size. |
user:zeman:treebanks:ru [2012/01/13 21:48] zeman Parsing. |
||
---|---|---|---|
Line 32: | Line 32: | ||
* David Mareček, Natalia Kljueva: [[http:// | * David Mareček, Natalia Kljueva: [[http:// | ||
* Documentation | * Documentation | ||
- | * Description of tags and feature values is hard to find; see also the [[#Inside|Inside section below]]. | + | * Daniel Zeman: {{: |
==== Domain ==== | ==== Domain ==== | ||
Line 40: | Line 40: | ||
==== Size ==== | ==== Size ==== | ||
- | There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, | + | There are 497,465 tokens in 34895 sentences, yielding 14.26 tokens per sentence on average. The original data was not split to training and test. In our HamleDT experiments, |
==== Inside ==== | ==== Inside ==== | ||
- | We have a Treex reader for the Syntagrus | + | The native |
- | Both versions | + | Morphological annotation has probably been done manually and it contains lemmas |
- | Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!): | + | The syntactic trees do not contain punctuation. Punctuation tokens have not been removed |
- | + | ||
- | * IZE = noun | + | |
- | * ARR = common | + | |
- | * IZB = proper name | + | |
- | * LIB = place name | + | |
- | * ZKI = number | + | |
- | * ADJ = adjective | + | |
- | * ARR = common | + | |
- | * GAL = question | + | |
- | * ADI = verb | + | |
- | * SIN = simple | + | |
- | * ADK = composed | + | |
- | * ADP = periphrastic | + | |
- | * FAK = factitive | + | |
- | * ADB = adverb | + | |
- | * ARR = common | + | |
- | * GAL = question | + | |
- | * DET = determiner | + | |
- | * ERKARR = demonstrative common | + | |
- | * ERKIND = demonstrative emphatic | + | |
- | * NOLARR = indefinite common | + | |
- | * NOLGAL = indefinite question | + | |
- | * ZNB = number | + | |
- | * DZH = definite | + | |
- | * BAN = distributive | + | |
- | * ORD = ordinal | + | |
- | * DZG = indefinite | + | |
- | * ORO = general | + | |
- | * IOR = pronoun | + | |
- | * PERARR = personal common | + | |
- | * PERIND = personal emphatic | + | |
- | * IZGMGB = indefinite | + | |
- | * IZGGAL = question | + | |
- | * BIH = ??? | + | |
- | * ELK = ??? | + | |
- | * LOT = link | + | |
- | * LOK = connector | + | |
- | * JNT = conjunction | + | |
- | * PRT = particle | + | |
- | * ITJ = interjection | + | |
- | * BST = other | + | |
- | * ADL = auxiliary verb | + | |
- | * ADT = synthetic verb | + | |
- | * SIG = acronym | + | |
- | * SNB = symbol | + | |
- | * LAB = abbreviation | + | |
- | + | ||
- | Main features: | + | |
- | + | ||
- | * KAS = case. Various descriptions of Basque grammar list different numbers of cases and it is not easy to match all of the BDT case tags with them. Some but not all of them are described | + | |
- | * KAS:ABL (984) = ablativo = ablative | + | |
- | * KAS:ABS (22805) = absolutivo = absolutive | + | |
- | * KAS:ABU (32) = adlativo terminal (" | + | |
- | * KAS:ABZ (27) = adlativo direccional (" | + | |
- | * KAS:ALA (1093) = adlativo = allative | + | |
- | * KAS:BNK (13) =? special case of the locative genitive (" | + | |
- | * KAS:DAT (1451) = dativo = dative | + | |
- | * KAS:DES (181) = destinativo = benefactive (" | + | |
- | * KAS:DESK (223) =? descriptive locative genitive (" | + | |
- | * KAS:EM (705) = multiword token with postposition (e.g. " | + | |
- | * KAS:ERG (6059) = ergativo = ergative | + | |
- | * KAS:GEL (6259) = genitivo locativo = locative genitive | + | |
- | * KAS:GEN (4307) = genitivo de posesión = possessive genitive | + | |
- | * KAS:INE (7690) = inesivo = inessive | + | |
- | * KAS:INS (1370) = instrumental | + | |
- | * KAS:MOT (165) = motivativo = causative | + | |
- | * KAS:PAR (930) = partitivo = partitive | + | |
- | * KAS:PRO (89) = prolativo = essive | + | |
- | * KAS:SOZ (928) = asociativo = comitative | + | |
- | * ASP = aspect | + | |
- | * ERL = relation (relative sentence, completive sentence, indirect question...) | + | |
- | + | ||
- | List of all 286 features found in the corpus with frequencies: | + | |
- | * ADM: | + | |
- | * ADM: | + | |
- | * ADM: | + | |
- | * ASP: | + | |
- | * ASP: | + | |
- | * ASP: | + | |
- | * ASP: | + | |
- | * BIZ: | + | |
- | * BIZ: | + | |
- | * ENT:??? | + | |
- | * ENT: | + | |
- | * ENT: | + | |
- | * ENT: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * ERL: | + | |
- | * HIT:NO 50 | + | |
- | * HIT:TO 38 | + | |
- | * IZAUR: | + | |
- | * IZAUR: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KAS: | + | |
- | * KLM:AM 80 | + | |
- | * KLM:HAS 2 | + | |
- | * MAI: | + | |
- | * MAI: | + | |
- | * MAI: | + | |
- | * MAI: | + | |
- | * MDN: | + | |
- | * MDN: | + | |
- | * MDN:A4 1 | + | |
- | * MDN: | + | |
- | * MDN: | + | |
- | * MDN: | + | |
- | * MDN:B3 11 | + | |
- | * MDN:B4 59 | + | |
- | * MDN:B5A 1 | + | |
- | * MDN: | + | |
- | * MDN:B6 1 | + | |
- | * MDN:B7 79 | + | |
- | * MDN:B8 38 | + | |
- | * MDN:C 52 | + | |
- | * MOD: | + | |
- | * MOD: | + | |
- | * MTKAT: | + | |
- | * MTKAT: | + | |
- | * MTKAT: | + | |
- | * MUG: | + | |
- | * MUG: | + | |
- | * MW:B 3615 | + | |
- | * NEUR: | + | |
- | * NMG: | + | |
- | * NMG: | + | |
- | * NMG: | + | |
- | * NOR: | + | |
- | * NOR: | + | |
- | * NOR:HI 20 | + | |
- | * NOR: | + | |
- | * NOR: | + | |
- | * NOR:ZU 93 | + | |
- | * NOR: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORI: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NORK: | + | |
- | * NUM: | + | |
- | * NUM: | + | |
- | * NUM: | + | |
- | * PER: | + | |
- | * PER: | + | |
- | * PER:HI 14 | + | |
- | * PER: | + | |
- | * PER: | + | |
- | * PER:ZU 60 | + | |
- | * PER: | + | |
- | * PLU:+ 149 | + | |
- | * PLU: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * POS: | + | |
- | * ZENB: | + | |
- | * _ 36940 | + | |
- | + | ||
- | The syntactic guidelines (structure and labels) are described in Spanish in this [[http:// | + | |
- | + | ||
- | Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. Espainia_Poliziak, | + | |
==== Sample ==== | ==== Sample ==== | ||
- | The first sentence of the CoNLL 2007 training data: | + | The first sentence of our test file in the native TGT format: |
- | | 1 | espainiako_poliziak | Espainia_Poliziak | IZE | IZE_LIB | PLU-<nowiki>|</nowiki>ENTI_LOC | 4 | ncsubj | _ | _ | | + | <code xml><?xml version=" |
- | | 2 | hiru | hiru | DET | DET_DZH | NMGP | 3 | detmod | _ | _ | | + | <text><inf> |
- | | 3 | gazte | gazte | IZE | IZE_ARR | ABS<nowiki>|</nowiki>MG | 4 | ncobj | _ | _ | | + | <author>Л, ТЕЛЕНЬ</author> |
- | | 4 | atxilotu | atxilotu | ADI | ADI_SIN | PART<nowiki>|</ | + | <title>Интервью " |
- | | 5 | ditu | *edun | ADL | ADL | A1< | + | <source/> |
- | | 6 | atarrabian | Atarrabia | IZE | IZE_LIB | PLU-< | + | <annot>ЛИ</annot> |
- | | 7 | , | , | PUNC | PUNC_KOMA | _ | 6 | PUNC | _ | _ | | + | <editor>ИБ</editor> |
- | | 8 | eta | eta | LOT | LOT_JNT | - | 0 | ROOT | _ | _ | | + | <date>10.12.03</date> |
- | | 9 | madrilera | Madril | IZE | IZE_LIB | PLU-< | + | <comment/> |
- | | 10 | eraman | eraman | ADI | ADI_SIN | PART<nowiki>|</ | + | </inf> |
- | | 11 | ditu | *edun | ADL | ADL | A1< | + | <body> |
- | | 12 | . | . | PUNC | PUNC_PUNC | _ | 11 | PUNC | _ | _ | | + | <S ID=" |
- | + | <W DOM=" | |
- | The first sentence of the CoNLL 2007 test data: | + | <W DOM=" |
- | + | <W DOM=" | |
- | | 1 | epaileek | epaile | IZE | IZE_ARR | BIZ+< | + | <W DOM=" |
- | | 2 | diote | esan | ADT | ADT | PNT< | + | <W DOM=" |
- | | 3 | eaeko | EAE | IZE | IZE_LIB | SIG<nowiki>|</nowiki>GEL< | + | <W DOM="3" FEAT="S ЕД СРЕД ВИН НЕОД" |
- | | 4 | parlamentarioek | parlamentario | ADJ | ADJ_ARR | IZAUR-<nowiki>|</nowiki>ERG< | + | <W DOM="8" FEAT="S ЕД СРЕД ВИН" |
- | | 5 | eaetik_kanpo | EAE | SIG | SIG- | DEK<nowiki>|</ | + | <W DOM=" |
- | | 6 | eginiko | egin | ADI | ADI_SIN | PART< | + | <W DOM=" |
- | | 7 | delituak | delitu | IZE | IZE_ARR | BIZ-< | + | <W DOM="9" FEAT="S ЕД МУЖ РОД ОД" ID="10" LEMMA=" |
- | | 8 | ikertzea | ikertu | ADI | ADI_SIN | ADIZE< | + | </ |
- | | 9 | eta | eta | LOT | LOT_JNT | - | | + | |
- | | 10 | epaitzea | epaitu | ADI | ADI_SIN | ADIZE< | + | |
- | | 11 | auzitegi_gorenari | auzitegi_gora | ADJ | ADJ_IZO | DEK< | + | |
- | | 12 | dagokiola | egon | ADT | ADT | PNT< | + | |
- | | 13 | , | , | PUNC | PUNC_KOMA | _ | | + | |
- | | 14 | baina | baina | LOT | LOT_JNT | AURK | | + | |
- | | 15 | atzerrian | atzerri | IZE | IZE_ARR | INE< | + | |
- | | 16 | izaniko | izan | ADI | ADI_SIN | PART< | + | |
- | | 17 | kontaktu | kontaktu | IZE | IZE_ARR | _ | | + | |
- | | 18 | horiek | horiek | DET | DET_ERKARR | ABS<nowiki>|</ | + | |
- | | 19 | ezin_direla | ezin_izan | ADI | ADI_ADK | PNT<nowiki>|</ | + | |
- | | 20 | delitutzat | delitu | IZE | IZE_ARR | BIZ-<nowiki> | + | |
- | | 21 | hartu | hartu | ADI | ADI_SIN | PART | | + | |
- | | 22 | . | . | PUNC | PUNC_PUNC | _ | | + | |
- | + | ||
- | The first sentence of the BDT-II training data: | + | |
- | + | ||
- | | 1 | Estatu_Batuetako_DEAko | Estatu_Batuak_DEA | IZE | LIB | PLU: | + | |
- | | 2 | buru | buru | IZE | ARR | _ | 4 | ncsubj | _ | _ | | + | |
- | | 3 | ohiak | ohi | ADJ | ARR | IZAUR:-<nowiki> | + | |
- | | 4 | aztertuko | aztertu | ADI | SIN | ADM:PART<nowiki>|</nowiki>ASP:GERO | 0 | ROOT | _ | _ | | + | |
- | | 5 | du | *edun | ADL | ADL | MDN:A1<nowiki> | + | |
- | | 6 | RUCen | RUC | IZE | IZB | MTKAT: | + | |
- | | 7 | erreforma | erreforma | IZE | ARR | KAS:ABS<nowiki>|</nowiki>NUM: | + | |
- | | 8 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 7 | PUNC | _ | _ | | + | |
- | + | ||
- | The first sentence of the BDT-II development data: | + | |
- | + | ||
- | | 1 | Irakaskuntzan | irakaskuntza | IZE | ARR | BIZ:-<nowiki> | + | |
- | | 2 | jardun | jardun | ADI | SIN | ADM: | + | |
- | | 3 | zuen | *edun | ADL | ADL | MDN: | + | |
- | | 4 | Miel | Miel | IZE | IZB | PLU: | + | |
- | | 5 | Anjel_Elustondok | Anjel_Elustondo | IZE | IZB | PLU: | + | |
- | | 6 | 1980 | 1980 | IZE | ZKI | _ | 7 | ncmod | _ | _ | | + | |
- | | 7 | urtetik | urte | IZE | ARR | BIZ:-<nowiki>|</nowiki>KAS: | + | |
- | | 8 | 1992ra | 1992 | IZE | ZKI | KAS:ALA<nowiki> | + | |
- | | 9 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 8 | PUNC | _ | _ | | + | |
- | | 10 | hauetatik | hauek | DET | ERKARR | KAS: | + | |
- | | 11 | hamar | hamar | DET | DZH | NMG:P | 12 | detmod | _ | _ | | + | |
- | | 12 | urtez | urte | IZE | ARR | BIZ: | + | |
- | | 13 | Azpeitiko | Azpeitia | IZE | LIB | PLU: | + | |
- | | 14 | ikastolan | ikastola | IZE | ARR | BIZ:-<nowiki>|</nowiki>KAS: | + | |
- | | 15 | irakasle | irakasle | IZE | ARR | KAS:ABS<nowiki> | + | |
- | | 16 | eta | eta | LOT | JNT | ERL:EMEN | 8 | aponcmod | _ | _ | | + | |
- | | 17 | beste | beste | DET | DZG | _ | 18 | detmod | _ | _ | | + | |
- | | 18 | biak | bi | IZE | ZKI | KAS: | + | |
- | | 19 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 18 | PUNC | _ | _ | | + | |
- | | 20 | Arabako | Araba | IZE | LIB | PLU:-<nowiki> | + | |
- | | 21 | ikastolen | ikastola | IZE | ARR | BIZ: | + | |
- | | 22 | elkartean | elkarte | IZE | ARR | BIZ: | + | |
- | | 23 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 22 | PUNC | _ | _ | | + | |
- | + | ||
- | The first sentence of the BDT-II test data: | + | |
- | + | ||
- | | 1 | Hegoaldean | hegoalde | IZE | ARR | KAS: | + | |
- | | 2 | iduri_zait | iduri_izan | ADI | ADK | ASP: | + | |
- | | 3 | euskararen | euskara | IZE | ARR | BIZ:-<nowiki>|</nowiki>KAS: | + | |
- | | 4 | mundu | mundu | IZE | ARR | BIZ:- | 7 | ncsubj | _ | _ | | + | |
- | | 5 | hau | hau | DET | ERKARR | KAS:ABS<nowiki> | + | |
- | | 6 | adi-adi | adi-adi | ADB | ARR | _ | 7 | ncmod | _ | _ | | + | |
- | | 7 | dagola | egon | ADT | ADT | ASP: | + | |
- | | 8 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 7 | PUNC | _ | _ | | + | |
- | | 9 | Euskaltzaindiak | Euskaltzaindia | IZE | LIB | PLU: | + | |
- | | 10 | zer | zer | DET | NOLGAL | NMG: | + | |
- | | 11 | erranen | erran | ADI | SIN | ADM: | + | |
- | | 12 | duen | *edun | ADL | ADL | ERL: | + | |
- | | 13 | zain | zain | ADB | ARR | _ | 7 | cmod | _ | _ | | + | |
- | | 14 | , | , | PUNT_MARKA | PUNT_KOMA | _ | 13 | PUNC | _ | _ | | + | |
- | | 15 | haren | hura | DET | ERKARR | KAS: | + | |
- | | 16 | arauen | arau | IZE | ARR | KAS: | + | |
- | | 17 | berehala | berehala | ADB | ARR | _ | 18 | ncmod | _ | _ | | + | |
- | | 18 | betetzeko | bete | ADI | SIN | ADM: | + | |
- | | 19 | . | . | PUNT_MARKA | PUNT_PUNT | _ | 18 | PUNC | _ | _ | | + | |
==== Parsing ==== | ==== Parsing ==== | ||
- | BDT is a mildly nonprojective treebank. 1925 of the 151,604 tokens | + | Nonprojectivities in SynTagRus are not frequent. Only 4146 of the 497,465 tokens are attached nonprojectively (0.83%). |
- | The results | + | Parsing |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ | ||
- | | Malt (Nilsson et al.) | 76.94 | 82.84 | | + | | Malt (Nilsson et al.) | 82.3 | 89.1 | |
- | | Titov et al. | 75.49 | 81.93 | | + | |
- | | Sagae | 74.64 | 81.19 | | + | |
- | | Carreras | 75.75 | 81.11 | | + | |
- | | Nakagawa | 72.56 | 81.04 | | + | |
- | | Malt (J. Hall et al.) | 74.99 | 80.61 | | + | |
- | | Johansson et al. | 75.08 | 80.43 | | + | |
- | + | ||
- | The two Malt parser results of 2007 (single malt and blended) are described in [[http:// | + | |
- | Parsing results on BDT-II have been published in Kepa Bengoetxea, Koldo Gojenola: [[http:// |