Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:treebanks:hr [2014/07/17 20:59] zeman Size and Inside. |
user:zeman:treebanks:hr [2014/07/28 16:31] (current) zeman Documentation of syntactic tags. |
||
---|---|---|---|
Line 28: | Line 28: | ||
* Documentation | * Documentation | ||
* [[http:// | * [[http:// | ||
+ | * A discussion of the syntactic tags is in Danijela Merkler, Željko Agić, Ana Agić: [[http:// | ||
==== Domain ==== | ==== Domain ==== | ||
Line 38: | Line 39: | ||
The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. | The improved pre-release version contains 83640 tokens in 3736 sentences, yielding 22.39 tokens per sentence on average. | ||
+ | |||
+ | There is no official training-test division of the original data. For HamleDT, we have split the data 90:10, i.e. the first 3362 sentences (75236 tokens) for training and the remaining 374 sentences (8404 tokens) for testing. | ||
==== Inside ==== | ==== Inside ==== | ||
Line 43: | Line 46: | ||
All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above. | All sentences in the improved pre-release version are manually annotated on morphological and syntactic levels. The officially available version 1 is a mixture of manual and automatic annotation, see the section on sizes above. | ||
- | ==== XXXXXXXXXXXXXXXXXXXXXXXXXXXXXXXX ==== | + | The treebank is distributed in the [[: |
- | ==== Sample ==== | + | |
- | The first three sentences | + | In Version 1, if there is a token that has empty (" |
- | | 1 | Глава | _ | N | Nc | _ | 0 | ROOT | 0 | ROOT | | + | All sentences in the improved pre-release contain dependency information; |
- | | 2 | трета | _ | M | Mo | gen=f< | + | |
- | | |||||||||| | + | |
- | | 1 | НАРОДНО | _ | A | An | gen=n< | + | |
- | | 2 | СЪБРАНИЕ | _ | N | Nc | gen=n< | + | |
- | | |||||||||| | + | |
- | | 1 | Народното | _ | A | An | gen=n< | + | |
- | | 2 | събрание | _ | N | Nc | gen=n< | + | |
- | | 3 | осъществява | _ | V | Vpi | trans=t< | + | |
- | | 4 | законодателната | _ | A | Af | gen=f< | + | |
- | | 5 | власт | _ | N | Nc | _ | 3 | obj | 3 | obj | | + | |
- | | 6 | и | _ | C | Cp | _ | 3 | conj | 3 | conj | | + | |
- | | 7 | упражнява | _ | V | Vpi | trans=t< | + | |
- | | 8 | парламентарен | _ | A | Am | gen=m< | + | |
- | | 9 | контрол | _ | N | Nc | gen=m< | + | |
- | | 10 | . | _ | Punct | Punct | _ | 3 | punct | 3 | punct | | + | |
- | The first three sentences of the CoNLL 2006 test data: | + | The syntactic tags (DEPREL) are simplistic but somewhat inspired by the Prague Dependency Treebank, there are only 15 of them: |
- | | 1 | Единственото | + | ^ Tag ^ Percent ^ Example ^ Description ^ |
- | | 2 | решение | + | | Adv | |
+ | | Ap | 3% | Esat | appositional modifier, incl. first name attached to last name | | ||
+ | | Atr | 26% | privatizacije | attribute modifying a noun phrase | | ||
+ | | Atv | 2% | iskoristiti | ? | | ||
+ | | Aux | 7% | se | ? | | ||
+ | | Co | 3% | a | conjunction as coordination head (Prague-style coordinations) | | ||
+ | | Elp | 0.6% | Proces | ellipsis | | ||
+ | | Obj | 7% | privatizacije | object of a verb | | ||
+ | | Oth | 2% | Barem | other | | ||
+ | | Pnom | 2% | složen | nominal predicate attached to copula | | ||
+ | | Pred | 10% | analizira | predicate (verbal) | | ||
+ | | Prep | 10% | na | preposition | | ||
+ | | Punc | 13% | . | punctuation | | ||
+ | | Sb | 7% | Kosovo | subject | | ||
+ | | Sub | 4% | da | subordinating conjunction | | ||
+ | |||
+ | (The sum of the percentages exceeds 100% because of rounding.) | ||
+ | |||
+ | === Cycles === | ||
+ | |||
+ | Eight dependency graphs in the pre-release version contain cycles. Most of the time these are individual nodes attached to themselves (according to Željko, this is the default in the annotation software, thus the annotator probably just forgot to attach the nodes). Five of them are punctuation nodes and fixing the attachment should be relatively easy. The only complicated case is the sentence #25 in the test file. Its dependency graph is wrong at multiple spots. | ||
+ | |||
+ | train/ | ||
+ | Analitičari upozoravaju na kosovski trend: osnivanje novih političkih stranaka neposredno prije izbora, a od strane ljudi iz već postojećih političkih stranaka ili nekog drugog aspekta javnog života. | ||
+ | |||
+ | train/ | ||
+ | "Ne možemo mnogo učiniti kako bismo je spriječili da ide malo šetati ili plivati. | ||
+ | |||
+ | train/ | ||
+ | U međuvremenu, | ||
+ | |||
+ | train/ | ||
+ | "Nije riječ o tome da imamo jednu političku opciju koja tvrdi kako piramidu ne bi trebalo uništiti, dok druga smatra da je treba uništiti. | ||
+ | |||
+ | train/ | ||
+ | " | ||
+ | |||
+ | train/ | ||
+ | Ulaganja u Srbiji dosegnula su rekordnih 1,5 milijardi eura u 2005. godini, priopćila je u srijedu vlada, izražavajući očekivanja glede nastavka rasta i u sljedećoj godini. | ||
+ | |||
+ | One more Punc-CYCLE: | ||
+ | |||
+ | test/ | ||
+ | Rezultat je toga da je artikulacija praktičnih zajedničkih interesa postala teža, kao i definiranje konkretnih misija. | ||
+ | Překlad s pomocí Google Translate: | ||
+ | Důsledek toho je, že členění praktických společných zájmů se stalo těžší, jakož i vymezení konkrétních misí. | ||
+ | Tohle je asi jediný zajímavý případ. Nejde o pověšení uzlu na sebe sama. " | ||
+ | |||
+ | JINÉ: | ||
+ | V té větě train/ | ||
+ | |||
+ | Věta test/001#1 má v kořeni pomocné sloveso " | ||
+ | |||
+ | ==== Sample ==== | ||
+ | |||
+ | The first three sentences of the improved pre-relase version: | ||
+ | |||
+ | | 1 | Proces | proces | Ncmsn | Ncmsn | < | ||
+ | | 2 | privatizacije | ||
+ | | 3 | na | na | Sl | Sl | < | ||
+ | | 4 | Kosovu | Kosovo | Npnsl | Npnsl | <nowiki> | ||
+ | | 5 | pod | pod | Si | Si | < | ||
+ | | 6 | povećalom | povećalo | Ncnsi | Ncnsi | < | ||
| |||||||||| | | |||||||||| | ||
- | | 1 | Ерик | + | | 1 | Kosovo |
- | | 2 | Франк | + | | 2 | ozbiljno | ozbiljno | Rgp | Rgp | <nowiki> |
- | | 3 | Ръсел | + | | 3 | analizira | analizirati | Vmr3s | Vmr3s | < |
+ | | 4 | proces | ||
+ | | 5 | privatizacije | privatizacija | Ncfsg | Ncfsg | < | ||
+ | | 6 | u | u | Sl | Sl | < | ||
+ | | 7 | svjetlu | svjetlo | Ncnsl | Ncnsl | <nowiki> | ||
+ | | 8 | učestalih | učestao | Agpfpg | Agpfpg | ||
+ | | 9 | pritužbi | pritužba | Ncfpg | Ncfpg | < | ||
+ | | 10 | < | ||
| |||||||||| | | |||||||||| | ||
- | | 1 | Пълен | + | | 1 | Barem | barem | Rgp | Rgp | < |
- | | 2 | мрак | + | | 2 | na | na | Sl | Sl | < |
- | | 3 | и | _ | C | Cp | _ | 2 | conj | 2 | conj | | + | | 3 | papiru |
- | | 4 | пълна | + | | 4 | <nowiki>,</ |
- | | 5 | самота | + | | 5 | izgleda |
- | | 6 | . | _ | Punct | Punct | _ | 2 | punct | 2 | punct | | + | | 6 | kao | kao | Cs | Cs | <nowiki> |
+ | | 7 | odlična | odličan | Agpfsn | Agpfsn | ||
+ | | 8 | ideja | ideja | Ncfsn | Ncfsn | < | ||
+ | | 9 | < | ||
==== Parsing ==== | ==== Parsing ==== | ||
- | Nonprojectivities in BTB are rare. Only 747 of the 196, | + | Nonprojectivities in SETimes.HR |
- | + | ||
- | The results of the CoNLL 2006 shared task are [[http:// | + | |
- | + | ||
- | ^ Parser (Authors) ^ LAS ^ UAS ^ | + | |
- | | MST (McDonald et al.) | 87.57 | 92.04 | | + | |
- | | Malt (Nivre et al.) | 87.41 | 91.72 | | + | |
- | | Nara (Yuchang Cheng) | 86.34 | 91.30 | | + | |
+ | //Are there any published parsing results on this corpus?// |