[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:hr [2014/07/17 20:43]
zeman
user:zeman:treebanks:hr [2014/07/28 16:31] (current)
zeman Documentation of syntactic tags.
Line 28: Line 28:
   * Documentation   * Documentation
     * [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|Multext-East v5 Croatian Tagset]], 2013.     * [[http://nlp.ffzg.hr/data/tagging/msd-hr.html|Multext-East v5 Croatian Tagset]], 2013.
 +    * A discussion of the syntactic tags is in Danijela Merkler, Željko Agić, Ana Agić: [[http://www.sciencedirect.com/science/article/pii/S1877042813041931#|Babel Treebank of Public Messages in Croatian]]. In: Procedia – Social and Behavioral Sciences, vol. 95, pp. 490-497, 2013.
  
-==== XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ==== 
 ==== Domain ==== ==== Domain ====
  
-Unknown (“A set of Bulgarian sentences marked-up with detailed syntactic information. These sentences are mainly extracted from authentic Bulgarian texts. They are chosen with regards two criteria. First, they cover the variety of syntactic structures of BulgarianSecond, they show the statistical distribution of these phenomena in real texts.”) At least part of it is probably news (Novinar, Sega, Standart).+Croatian newspaper text from [[http://www.setimes.com/|Southeast European Times]].
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version contains 196,151 tokens in 13221 sentences, yielding 14.84 tokens per sentence on average (CoNLL 2006 data split: 190,217 tokens / 12823 sentences training, 5934 tokens / 398 sentences test).+Version 1 contains 178,981 tokens in 7995 sentences, yielding 22.39 tokens per sentence on average. The file is a mixture of trees and non-trees, as only 2490 sentences have been annotated on the syntactic level. Part of the corpus (up to line number 93124) contains manually assigned lemmas and morphosyntactic descriptions (tags)while the rest contains automatic morphological annotation. 
 + 
 +The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. 
 + 
 +There is no official training-test division of the original data. For HamleDTwe have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokensfor testing.
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL formatThere //should// be 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.+All sentences in the improved pre-release version are manually annotated on morphological and syntactic levelsThe officially available version 1 is mixture of manual and automatic annotation, see the section on sizes above.
  
-The morphological analysis does not include lemmasThe morphosyntactic tags have been assigned (probably) manually.+The treebank is distributed in the [[:format-conll|CoNLL 2006]] file formatMultext-East morphosyntactic tags appear in both the CPOS and POS columns, while the FEAT column is empty.
  
-The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).+In Version 1, if there is a token that has empty ("_") value of the DEPREL column, then the sentence has not been syntactically annotated (even though there //are// numbers in the HEAD column; these are fake head links, typically they refer to the same node).
  
-==== Sample ====+All sentences in the improved pre-release contain dependency information; however, at a few places there are errors introduced by the annotation software that result in a cyclic graph (not a tree).
  
-The first three sentences of the CoNLL 2006 training data:+The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them:
  
-Глава N | Nc | _ | 0 | ROOT | 0 | ROOT +^ Tag ^ Percent ^ Example ^ Description ^ 
-трета M | Mo | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod +Adv  5% Kosovu adverbial modifier 
-| |||||||||+Ap  3% Esat appositional modifier, incl. first name attached to last name 
-1 | НАРОДНО | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod 2 | mod +Atr  26% privatizacije attribute modifying a noun phrase 
-СЪБРАНИЕ N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT +Atv  2iskoristiti 
-| |||||||||| +Aux  7% se 
-| 1 | Народното | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | +Co  3conjunction as coordination head (Prague-style coordinations) 
-| 2 | събрание | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 3 | subj 3 | subj +Elp  0.6% Proces ellipsis 
-3 | осъществява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 ROOT +Obj  7% privatizacije object of a verb | 
-законодателната Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 5 | mod | 5 mod +Oth  2% Barem other 
-власт N | Nc | _ | 3 | obj | 3 | obj +Pnom  2% složen nominal predicate attached to copula 
-и C | Cp | _ | 3 | conj | 3 | conj +Pred  10% analizira predicate (verbal) 
-упражнява V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 3 | conjarg | 3 | conjarg +Prep  10% na preposition 
-парламентарен A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod +Punc  13% punctuation 
-9 | контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 7 | obj 7 | obj +Sb  7Kosovo subject 
-10 _ | Punct | Punct | _ | 3 | punct | 3 punct |+Sub  4% da subordinating conjunction |
  
-The first three sentences of the CoNLL 2006 test data:+(The sum of the percentages exceeds 100% because of rounding.)
  
-| 1 | Единственото An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod mod +=== Cycles === 
-решение | _ | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT | 0 | ROOT |+ 
 +Eight dependency graphs in the pre-release version contain cycles. Most of the time these are individual nodes attached to themselves (according to Željko, this is the default in the annotation software, thus the annotator probably just forgot to attach the nodes). Five of them are punctuation nodes and fixing the attachment should be relatively easy. The only complicated case is the sentence #25 in the test file. Its dependency graph is wrong at multiple spots. 
 + 
 +train/006#247:
 +Analitičari upozoravaju na kosovski trend: osnivanje novih političkih stranaka neposredno prije izbora, a od strane ljudi iz već postojećih političkih stranaka ili nekog drugog aspekta javnog života. 
 + 
 +train/006#381:mnogo 
 +"Ne možemo mnogo učiniti kako bismo je spriječili da ide malo šetati ili plivati. 
 + 
 +train/006#399:, 
 +U međuvremenu, troškovi života porasli su: najamnina za mali stan u Podgorici iznosi oko 200 eura mjesečno -- što mnogima otežava spajanje kraja s krajem. 
 + 
 +train/007#6:" 
 +"Nije riječ o tome da imamo jednu političku opciju koja tvrdi kako piramidu ne bi trebalo uništiti, dok druga smatra da je treba uništiti. 
 + 
 +train/007#190:
 +"Moramo biti svjesni kako se kod naroda stvara strah", izjavio je čelnik stranke ORA Veton Surroi kosovskom dnevniku Express, piše Reuters. 
 + 
 +train/007#359:, 
 +Ulaganja u Srbiji dosegnula su rekordnih 1,5 milijardi eura u 2005. godini, priopćila je u srijedu vlada, izražavajući očekivanja glede nastavka rasta i u sljedećoj godini. 
 + 
 +One more Punc-CYCLE:1 occurred somewhere else. 
 + 
 +test/001#25:toga 
 +Rezultat je toga da je artikulacija praktičnih zajedničkih interesa postala teža, kao i definiranje konkretnih misija. 
 +Překlad s pomocí Google Translate: 
 +Důsledek toho je, že členění praktických společných zájmů se stalo těžší, jakož i vymezení konkrétních misí. 
 +Tohle je asi jediný zajímavý případ. Nejde o pověšení uzlu na sebe sama. "Rezultat" visí na "postala", "postala" na "da", "da" na "toga" a "toga" chtěli pověsit opět na "Rezultat". Je tam k tomu i celkem divoká neprojektivita. Celá ta věta je podle mě rozebraná špatně (je tam několik dalších chyb) a chtělo by to, abychom ji v průběhu harmonizace úplně předělali. 
 + 
 +JINÉ: 
 +V té větě train/006#247 nahoře: "političkih stranaka", "političkih" je označeno jako apozice. Opravit chybu. Pokud je jako apozice přídavné jméno, které visí na následujícím podstatném jméně a shoduje se s ním v rodě, čísle a pádě, není to Apposition, ale Atr. 
 + 
 +Věta test/001#1 má v kořeni pomocné sloveso "je" a jeho deprel není Pred, ale Aux! 
 + 
 +==== Sample ==== 
 + 
 +The first three sentences of the improved pre-relase version: 
 + 
 +| 1 | Proces proces Ncmsn Ncmsn | <nowiki>_</nowiki> | 0 | Elp | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +| 2 | privatizacije privatizacija Ncfsg | Ncfsg | <nowiki>_</nowiki> | 1 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +na na | Sl | Sl | <nowiki>_</nowiki> Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +| 4 | Kosovu | Kosovo | Npnsl | Npnsl | <nowiki>_</nowiki> | 3 | Adv | <nowiki>_</nowiki><nowiki>_</nowiki> | 
 +| 5 | pod | pod | Si | Si | <nowiki>_</nowiki> | 0 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 6 | povećalom | povećalo | Ncnsi | Ncnsi | <nowiki>_</nowiki> | 5 | Elp | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | |||||||||| | ||||||||||
-| 1 | Ерик | _ | Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT | 0 | ROOT +| 1 | Kosovo Kosovo | Npnsn | Npnsn | <nowiki>_</nowiki> Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-Франк Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 2 | ozbiljno | ozbiljno | Rgp | Rgp | <nowiki>_</nowiki> | 3 | Adv <nowiki>_</nowiki><nowiki>_</nowiki> | 
-| 3 | Ръсел | _ | Hm gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod |+| 3 | analizira | analizirati | Vmr3s | Vmr3s | <nowiki>_</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +proces proces Ncmsan Ncmsan | <nowiki>_</nowiki> | 3 | Obj | <nowiki>_</nowiki> <nowiki>_</nowiki> | 
 +| 5 | privatizacije | privatizacija | Ncfsg | Ncfsg | <nowiki>_</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +| 6 | u | u | Sl | Sl | <nowiki>_</nowiki> | 3 | Prep <nowiki>_</nowiki> <nowiki>_</nowiki> | 
 +| 7 | svjetlu | svjetlo | Ncnsl | Ncnsl | <nowiki>_</nowiki> | 6 | Obj <nowiki>_</nowiki> <nowiki>_</nowiki>
 +| 8 | učestalih | učestao | Agpfpg | Agpfpg <nowiki>_</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki>
 +| 9 | pritužbi | pritužba | Ncfpg | Ncfpg | <nowiki>_</nowiki> | 7 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 10 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Z | Z | <nowiki>_</nowiki> | 0 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | |||||||||| | ||||||||||
-| 1 | Пълен Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod mod +| 1 | Barem barem Rgp Rgp | <nowiki>_</nowiki> | 2 | Oth | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-мрак | _ | Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT +| 2 | na na Sl | Sl | <nowiki>_</nowiki> | 5 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-и Cp | _ | conj conj +papiru papir | Ncmsl | Ncmsl | <nowiki>_</nowiki> Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-пълна Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 4 | <nowiki>,</nowiki> | <nowiki>,</nowiki><nowiki>_</nowiki> 2 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-самота Nc | _ | conjarg conjarg +izgleda izgledati Vmr3s Vmr3s <nowiki>_</nowiki> Pred <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| . | Punct Punct | _ | punct punct |+kao kao Cs Cs | <nowiki>_</nowiki> | 8 | Oth | <nowiki>_</nowiki> <nowiki>_</nowiki>
 +| 7 | odlična | odličan | Agpfsn | Agpfsn <nowiki>_</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +ideja ideja Ncfsn Ncfsn <nowiki>_</nowiki> Adv <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +<nowiki>.</nowiki> <nowiki>.</nowiki> <nowiki>_</nowiki> Punc <nowiki>_</nowiki> <nowiki>_</nowiki> |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). +Nonprojectivities in SETimes.HR are rare. Only 461 of the 83640 tokens in the pre-release version are attached nonprojectively (0.55%).
- +
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian: +
- +
-^ Parser (Authors) ^ LAS ^ UAS ^ +
-| MST (McDonald et al.) | 87.57 | 92.04 | +
-| Malt (Nivre et al.) | 87.41 | 91.72 | +
-| Nara (Yuchang Cheng) | 86.34 | 91.30 |+
  
 +//Are there any published parsing results on this corpus?//

[ Back to the navigation ] [ Back to the content ]