Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
user:zeman:treebanks:hr [2014/07/17 21:16] zeman |
user:zeman:treebanks:hr [2014/07/17 21:27] zeman Finalizing the page. |
| |
The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. | The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. |
| |
| There is no official training-test division of the original data. For HamleDT, we have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokens) for testing. |
| |
==== Inside ==== | ==== Inside ==== |
(The sum of the percentages exceeds 100% because of rounding.) | (The sum of the percentages exceeds 100% because of rounding.) |
| |
==== XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ==== | |
==== Sample ==== | ==== Sample ==== |
| |
The first three sentences of the CoNLL 2006 training data: | The first three sentences of the improved pre-relase version: |
| |
| 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT | | | 1 | Proces | proces | Ncmsn | Ncmsn | <nowiki>_</nowiki> | 0 | Elp | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | трета | _ | M | Mo | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod | | | 2 | privatizacije | privatizacija | Ncfsg | Ncfsg | <nowiki>_</nowiki> | 1 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 3 | na | na | Sl | Sl | <nowiki>_</nowiki> | 1 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 4 | Kosovu | Kosovo | Npnsl | Npnsl | <nowiki>_</nowiki> | 3 | Adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 5 | pod | pod | Si | Si | <nowiki>_</nowiki> | 0 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 6 | povećalom | povećalo | Ncnsi | Ncnsi | <nowiki>_</nowiki> | 5 | Elp | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | |||||||||| |
| 1 | НАРОДНО | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | | 1 | Kosovo | Kosovo | Npnsn | Npnsn | <nowiki>_</nowiki> | 3 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | СЪБРАНИЕ | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | | 2 | ozbiljno | ozbiljno | Rgp | Rgp | <nowiki>_</nowiki> | 3 | Adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 3 | analizira | analizirati | Vmr3s | Vmr3s | <nowiki>_</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 4 | proces | proces | Ncmsan | Ncmsan | <nowiki>_</nowiki> | 3 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 5 | privatizacije | privatizacija | Ncfsg | Ncfsg | <nowiki>_</nowiki> | 4 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 6 | u | u | Sl | Sl | <nowiki>_</nowiki> | 3 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 7 | svjetlu | svjetlo | Ncnsl | Ncnsl | <nowiki>_</nowiki> | 6 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 8 | učestalih | učestao | Agpfpg | Agpfpg | <nowiki>_</nowiki> | 9 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 9 | pritužbi | pritužba | Ncfpg | Ncfpg | <nowiki>_</nowiki> | 7 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 10 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Z | Z | <nowiki>_</nowiki> | 0 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | |||||||||| |
| 1 | Народното | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | | | 1 | Barem | barem | Rgp | Rgp | <nowiki>_</nowiki> | 2 | Oth | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | събрание | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 3 | subj | 3 | subj | | | 2 | na | na | Sl | Sl | <nowiki>_</nowiki> | 5 | Prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | осъществява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 0 | ROOT | 0 | ROOT | | | 3 | papiru | papir | Ncmsl | Ncmsl | <nowiki>_</nowiki> | 2 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | законодателната | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 5 | mod | 5 | mod | | | 4 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | Z | Z | <nowiki>_</nowiki> | 2 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj | | | 5 | izgleda | izgledati | Vmr3s | Vmr3s | <nowiki>_</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj | | | 6 | kao | kao | Cs | Cs | <nowiki>_</nowiki> | 8 | Oth | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | упражнява | _ | V | Vpi | trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s | 3 | conjarg | 3 | conjarg | | | 7 | odlična | odličan | Agpfsn | Agpfsn | <nowiki>_</nowiki> | 8 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | парламентарен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 9 | mod | 9 | mod | | | 8 | ideja | ideja | Ncfsn | Ncfsn | <nowiki>_</nowiki> | 5 | Adv | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | контрол | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 7 | obj | 7 | obj | | | 9 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Z | Z | <nowiki>_</nowiki> | 0 | Punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct | | |
| |
The first three sentences of the CoNLL 2006 test data: | |
| |
| 1 | Единственото | _ | A | An | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod | 2 | mod | | |
| 2 | решение | _ | N | Nc | gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | |
| |||||||||| | |
| 1 | Ерик | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | |
| 2 | Франк | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 1 | mod | 1 | mod | | |
| 3 | Ръсел | _ | H | Hm | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | |
| |||||||||| | |
| 1 | Пълен | _ | A | Am | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 2 | mod | 2 | mod | | |
| 2 | мрак | _ | N | Nc | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT | 0 | ROOT | | |
| 3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj | | |
| 4 | пълна | _ | A | Af | gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 5 | mod | 5 | mod | | |
| 5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg | | |
| 6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct | | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). | Nonprojectivities in SETimes.HR are rare. Only 461 of the 83640 tokens in the pre-release version are attached nonprojectively (0.55%). |
| |
The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian: | |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | |
| MST (McDonald et al.) | 87.57 | 92.04 | | |
| Malt (Nivre et al.) | 87.41 | 91.72 | | |
| Nara (Yuchang Cheng) | 86.34 | 91.30 | | |
| |
| //Are there any published parsing results on this corpus?// |