[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:hr [2014/07/17 20:59]
zeman Size and Inside.
user:zeman:treebanks:hr [2014/07/17 21:27]
zeman Finalizing the page.
Line 38: Line 38:
  
 The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average.
 +
 +There is no official training-test division of the original data. For HamleDT, we have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokens) for testing.
  
 ==== Inside ==== ==== Inside ====
Line 43: Line 45:
 All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above. All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above.
  
-==== XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ==== +The treebank is distributed in the [[:format-conll|CoNLL 2006]] file format. Multext-East morphosyntactic tags appear in both the CPOS and POS columns, while the FEAT column is empty.
-==== Sample ====+
  
-The first three sentences of the CoNLL 2006 training data:+In Version 1, if there is a token that has empty ("_") value of the DEPREL column, then the sentence has not been syntactically annotated (even though there //are// numbers in the HEAD column; these are fake head links, typically they refer to the same node).
  
-| 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT | +All sentences in the improved pre-release contain dependency information; however, at a few places there are errors introduced by the annotation software that result in a cyclic graph (not a tree).
-| 2 | трета | _ | M | Mo | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod | +
-| |||||||||| +
-| 1 | НАРОДНО | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | +
-| 2 | СЪБРАНИЕ | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | +
-| |||||||||| +
-| 1 | Народното | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | +
-| 2 | събрание | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 3 | subj | 3 | subj | +
-| 3 | осъществява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 | ROOT | +
-| 4 | законодателната | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 5 | mod | 5 | mod | +
-| 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj | +
-| 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj | +
-| 7 | упражнява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 3 | conjarg | 3 | conjarg | +
-| 8 | парламентарен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod | +
-| 9 | контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 7 | obj | 7 | obj | +
-| 10 | | _ | Punct | Punct | _ | 3 | punct | 3 | punct |+
  
-The first three sentences of the CoNLL 2006 test data:+The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them:
  
-Единственото An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod mod +^ Tag ^ Percent ^ Example ^ Description ^ 
-решение | _ | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT | 0 | ROOT |+Adv  5% Kosovu adverbial modifier | 
 +Ap |  3% | Esat | appositional modifier, incl. first name attached to last name | 
 +| Atr |  26% | privatizacije | attribute modifying a noun phrase | 
 +| Atv |  2% | iskoristiti | ? | 
 +| Aux |  7% | se | ? | 
 +| Co |  3% | a | conjunction as coordination head (Prague-style coordinations) | 
 +| Elp |  0.6% | Proces | ellipsis | 
 +| Obj |  7% | privatizacije | object of a verb | 
 +| Oth |  2% | Barem | other | 
 +| Pnom |  2% | složen | nominal predicate attached to copula | 
 +| Pred |  10% | analizira | predicate (verbal) | 
 +| Prep |  10% | na | preposition | 
 +| Punc |  13% | . | punctuation | 
 +| Sb |  7% | Kosovo | subject | 
 +| Sub |  4% | da | subordinating conjunction | 
 + 
 +(The sum of the percentages exceeds 100% because of rounding.) 
 + 
 +==== Sample ==== 
 + 
 +The first three sentences of the improved pre-relase version: 
 + 
 +| 1 | Proces | proces | Ncmsn | Ncmsn | <nowiki>_</nowiki> | 0 | Elp | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +| 2 | privatizacije privatizacija Ncfsg | Ncfsg | <nowiki>_</nowiki> | 1 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +na na | Sl | Sl | <nowiki>_</nowiki> Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +| 4 | Kosovu | Kosovo | Npnsl | Npnsl | <nowiki>_</nowiki> | 3 | Adv | <nowiki>_</nowiki><nowiki>_</nowiki> | 
 +| 5 | pod | pod | Si | Si | <nowiki>_</nowiki> | 0 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 6 | povećalom | povećalo | Ncnsi | Ncnsi | <nowiki>_</nowiki> | 5 | Elp | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | |||||||||| | ||||||||||
-| 1 | Ерик | _ | Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT | 0 | ROOT +| 1 | Kosovo Kosovo | Npnsn | Npnsn | <nowiki>_</nowiki> Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-Франк Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 2 | ozbiljno | ozbiljno | Rgp | Rgp | <nowiki>_</nowiki> | 3 | Adv <nowiki>_</nowiki><nowiki>_</nowiki> | 
-| 3 | Ръсел | _ | Hm gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod |+| 3 | analizira | analizirati | Vmr3s | Vmr3s | <nowiki>_</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
 +proces proces Ncmsan Ncmsan | <nowiki>_</nowiki> | 3 | Obj | <nowiki>_</nowiki> <nowiki>_</nowiki> | 
 +| 5 | privatizacije | privatizacija | Ncfsg | Ncfsg | <nowiki>_</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +| 6 | u | u | Sl | Sl | <nowiki>_</nowiki> | 3 | Prep <nowiki>_</nowiki> <nowiki>_</nowiki> | 
 +| 7 | svjetlu | svjetlo | Ncnsl | Ncnsl | <nowiki>_</nowiki> | 6 | Obj <nowiki>_</nowiki> <nowiki>_</nowiki>
 +| 8 | učestalih | učestao | Agpfpg | Agpfpg <nowiki>_</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki>
 +| 9 | pritužbi | pritužba | Ncfpg | Ncfpg | <nowiki>_</nowiki> | 7 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 10 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Z | Z | <nowiki>_</nowiki> | 0 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | |||||||||| | ||||||||||
-| 1 | Пълен Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod mod +| 1 | Barem barem Rgp Rgp | <nowiki>_</nowiki> | 2 | Oth | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-мрак | _ | Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT +| 2 | na na Sl | Sl | <nowiki>_</nowiki> | 5 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-и Cp | _ | conj conj +papiru papir | Ncmsl | Ncmsl | <nowiki>_</nowiki> Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-пълна Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 4 | <nowiki>,</nowiki> | <nowiki>,</nowiki><nowiki>_</nowiki> 2 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> 
-самота Nc | _ | conjarg conjarg +izgleda izgledati Vmr3s Vmr3s <nowiki>_</nowiki> Pred <nowiki>_</nowiki> <nowiki>_</nowiki> 
-| . | Punct Punct | _ | punct punct |+kao kao Cs Cs | <nowiki>_</nowiki> | 8 | Oth | <nowiki>_</nowiki> <nowiki>_</nowiki>
 +| 7 | odlična | odličan | Agpfsn | Agpfsn <nowiki>_</nowiki>Atr <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +ideja ideja Ncfsn Ncfsn <nowiki>_</nowiki> Adv <nowiki>_</nowiki> <nowiki>_</nowiki> 
 +<nowiki>.</nowiki> <nowiki>.</nowiki> <nowiki>_</nowiki> Punc <nowiki>_</nowiki> <nowiki>_</nowiki> |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). +Nonprojectivities in SETimes.HR are rare. Only 461 of the 83640 tokens in the pre-release version are attached nonprojectively (0.55%).
- +
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian: +
- +
-^ Parser (Authors) ^ LAS ^ UAS ^ +
-| MST (McDonald et al.) | 87.57 | 92.04 | +
-| Malt (Nivre et al.) | 87.41 | 91.72 | +
-| Nara (Yuchang Cheng) | 86.34 | 91.30 |+
  
 +//Are there any published parsing results on this corpus?//

[ Back to the navigation ] [ Back to the content ]