Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks:hr [2014/07/17 20:59] zeman Size and Inside. |
user:zeman:treebanks:hr [2014/07/17 21:27] zeman Finalizing the page. |
||
---|---|---|---|
Line 38: | Line 38: | ||
The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. | The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. | ||
+ | |||
+ | There is no official training-test division of the original data. For HamleDT, we have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokens) for testing. | ||
==== Inside ==== | ==== Inside ==== | ||
Line 43: | Line 45: | ||
All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above. | All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above. | ||
- | ==== XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ==== | + | The treebank is distributed in the [[: |
- | ==== Sample ==== | + | |
- | The first three sentences | + | In Version 1, if there is a token that has empty (" |
- | | 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT | | + | All sentences in the improved pre-release contain dependency information; |
- | | 2 | трета | _ | M | Mo | gen=f< | + | |
- | | |||||||||| | + | |
- | | 1 | НАРОДНО | _ | A | An | gen=n< | + | |
- | | 2 | СЪБРАНИЕ | _ | N | Nc | gen=n< | + | |
- | | |||||||||| | + | |
- | | 1 | Народното | _ | A | An | gen=n< | + | |
- | | 2 | събрание | _ | N | Nc | gen=n< | + | |
- | | 3 | осъществява | _ | V | Vpi | trans=t< | + | |
- | | 4 | законодателната | _ | A | Af | gen=f< | + | |
- | | 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj | | + | |
- | | 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj | | + | |
- | | 7 | упражнява | _ | V | Vpi | trans=t< | + | |
- | | 8 | парламентарен | _ | A | Am | gen=m< | + | |
- | | 9 | контрол | _ | N | Nc | gen=m< | + | |
- | | 10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct | | + | |
- | The first three sentences of the CoNLL 2006 test data: | + | The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them: |
- | | 1 | Единственото | + | ^ Tag ^ Percent ^ Example ^ Description ^ |
- | | 2 | решение | + | | Adv | |
+ | | Ap | 3% | Esat | appositional modifier, incl. first name attached to last name | | ||
+ | | Atr | 26% | privatizacije | attribute modifying a noun phrase | | ||
+ | | Atv | 2% | iskoristiti | ? | | ||
+ | | Aux | 7% | se | ? | | ||
+ | | Co | 3% | a | conjunction as coordination head (Prague-style coordinations) | | ||
+ | | Elp | 0.6% | Proces | ellipsis | | ||
+ | | Obj | 7% | privatizacije | object of a verb | | ||
+ | | Oth | 2% | Barem | other | | ||
+ | | Pnom | 2% | složen | nominal predicate attached to copula | | ||
+ | | Pred | 10% | analizira | predicate (verbal) | | ||
+ | | Prep | 10% | na | preposition | | ||
+ | | Punc | 13% | . | punctuation | | ||
+ | | Sb | 7% | Kosovo | subject | | ||
+ | | Sub | 4% | da | subordinating conjunction | | ||
+ | |||
+ | (The sum of the percentages exceeds 100% because of rounding.) | ||
+ | |||
+ | ==== Sample ==== | ||
+ | |||
+ | The first three sentences of the improved pre-relase version: | ||
+ | |||
+ | | 1 | Proces | proces | Ncmsn | Ncmsn | < | ||
+ | | 2 | privatizacije | ||
+ | | 3 | na | na | Sl | Sl | < | ||
+ | | 4 | Kosovu | Kosovo | Npnsl | Npnsl | <nowiki> | ||
+ | | 5 | pod | pod | Si | Si | < | ||
+ | | 6 | povećalom | povećalo | Ncnsi | Ncnsi | < | ||
| |||||||||| | | |||||||||| | ||
- | | 1 | Ерик | + | | 1 | Kosovo |
- | | 2 | Франк | + | | 2 | ozbiljno | ozbiljno | Rgp | Rgp | <nowiki> |
- | | 3 | Ръсел | + | | 3 | analizira | analizirati | Vmr3s | Vmr3s | < |
+ | | 4 | proces | ||
+ | | 5 | privatizacije | privatizacija | Ncfsg | Ncfsg | < | ||
+ | | 6 | u | u | Sl | Sl | < | ||
+ | | 7 | svjetlu | svjetlo | Ncnsl | Ncnsl | <nowiki> | ||
+ | | 8 | učestalih | učestao | Agpfpg | Agpfpg | ||
+ | | 9 | pritužbi | pritužba | Ncfpg | Ncfpg | < | ||
+ | | 10 | < | ||
| |||||||||| | | |||||||||| | ||
- | | 1 | Пълен | + | | 1 | Barem | barem | Rgp | Rgp | < |
- | | 2 | мрак | + | | 2 | na | na | Sl | Sl | < |
- | | 3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj | | + | | 3 | papiru |
- | | 4 | пълна | + | | 4 | <nowiki>,</ |
- | | 5 | самота | + | | 5 | izgleda |
- | | 6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct | | + | | 6 | kao | kao | Cs | Cs | <nowiki> |
+ | | 7 | odlična | odličan | Agpfsn | Agpfsn | ||
+ | | 8 | ideja | ideja | Ncfsn | Ncfsn | < | ||
+ | | 9 | < | ||
==== Parsing ==== | ==== Parsing ==== | ||
- | Nonprojectivities in BTB are rare. Only 747 of the 196, | + | Nonprojectivities in SETimes.HR |
- | + | ||
- | The results of the CoNLL 2006 shared task are [[http:// | + | |
- | + | ||
- | ^ Parser (Authors) ^ LAS ^ UAS ^ | + | |
- | | MST (McDonald et al.) | 87.57 | 92.04 | | + | |
- | | Malt (Nivre et al.) | 87.41 | 91.72 | | + | |
- | | Nara (Yuchang Cheng) | 86.34 | 91.30 | | + | |
+ | //Are there any published parsing results on this corpus?// |