Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Last revision Both sides next revision | ||
user:zeman:treebanks:et [2011/11/21 13:26] zeman Obtaining, license, references and domain. |
user:zeman:treebanks:et [2011/11/28 09:48] zeman Size and nonprojectivity. |
||
---|---|---|---|
Line 13: | Line 13: | ||
The EKP is freely [[http:// | The EKP is freely [[http:// | ||
- | EKP was created / coordinated (?) by Kaili Müürisep, [[http:// | + | EKP was created / coordinated (?) by Kaili Müürisep, [[http:// |
==== References ==== | ==== References ==== | ||
Line 37: | Line 37: | ||
==== Size ==== | ==== Size ==== | ||
- | According to their website, | + | All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. |
- | The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens | + | ^ File ^ Sentences ^ Terminals ^ Average t/s ^ |
- | + | | arborest.xml | 175 | 2451 | 14.01 | | |
- | The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, | + | | piialaused.xml | 732 | 4505 | 6.15 | |
+ | | ratsepalaused.xml | 388 | 2348 | 6.05 | | ||
+ | | sul.xml | 20 | 187 | 9.35 | | ||
+ | | **total** | 1315 | 9491 | 7.22 | | ||
==== Inside ==== | ==== Inside ==== | ||
- | The treebank is part of the [[http:// | + | The treebank is part of the [[http:// |
- | All versions contain // | + | The annotation contains lemmas, |
- | It is not clear what the //semi-automatic// annotation means (probably first auto-tagging, then manual correction? | + | Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities. |
- | + | ||
- | The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format. | + | |
==== Sample ==== | ==== Sample ==== | ||
- | The first sentence of TIGER Treebank 2.1 in the TIGER-XML format: | + | The first sentence of the corpus |
- | <code xml>< | + | <code xml>< |
- | <graph root=" | + | <graph root="ratsep-13_501"> |
- | < | + | < |
- | <t id=" | + | <t id="ratsep-13_1" word="Peeter" lemma=" |
- | <t id=" | + | <t id="ratsep-13_2" word="aerutas" lemma=" |
- | <t id="s1_3" word="Perot" lemma=" | + | <t id="ratsep-13_3" word="üle" lemma=" |
- | <t id="s1_4" word="wäre" lemma=" | + | <t id="ratsep-13_4" word="väina" lemma=" |
- | <t id="s1_5" word="vielleicht" lemma=" | + | <t id="ratsep-13_5" word="saarele" lemma=" |
- | <t id="s1_6" word="ein" lemma=" | + | <t id="ratsep-13_6" word="puhkama" lemma=" |
- | <t id="s1_7" word="prächtiger" lemma=" | + | <t id="ratsep-13_7" word="." lemma=" |
- | <t id="s1_8" word="Diktator" lemma=" | + | </ |
- | <t id="s1_9" word="'' | + | |
- | </ | + | < |
- | < | + | <nt id="ratsep-13_501" cat="VROOT"> |
- | <nt id="s1_500" cat="PN"> | + | <edge label=" |
- | <edge label=" | + | </ |
- | < | + | <nt id="ratsep-13_502" cat="fcl"> |
- | | + | <edge label=" |
- | <nt id="s1_501" cat="NP"> | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | </ | + | <edge label=" |
- | <nt id=" | + | </ |
- | | + | <nt id="ratsep-13_503" cat="pp"> |
- | <edge label=" | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | < | + | </ |
- | | + | </ |
- | <nt id="s1_VROOT" cat="VROOT"> | + | </ |
- | <edge label=" | + | |
- | <edge label=" | + | |
- | <edge label=" | + | |
- | </ | + | |
- | </ | + | |
- | </ | + | |
</ | </ | ||
- | |||
- | The first sentence of the CoNLL 2006 training data: | ||
- | |||
- | | 1 | `` | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | | ||
- | | 2 | Ross | _ | NE | NE | _ | 4 | SB | 4 | SB | | ||
- | | 3 | Perot | _ | NE | NE | _ | 2 | PNC | 2 | PNC | | ||
- | | 4 | wäre | _ | VAFIN | VAFIN | _ | 0 | ROOT | 0 | ROOT | | ||
- | | 5 | vielleicht | _ | ADV | ADV | _ | 4 | MO | 4 | MO | | ||
- | | 6 | ein | _ | ART | ART | _ | 8 | NK | 8 | NK | | ||
- | | 7 | prächtiger | _ | ADJA | ADJA | _ | 8 | NK | 8 | NK | | ||
- | | 8 | Diktator | _ | NN | NN | _ | 4 | PD | 4 | PD | | ||
- | | 9 | < | ||
- | |||
- | The first sentence of the CoNLL 2006 test data: | ||
- | |||
- | | 1 | Zwei | _ | CARD | CARD | _ | 2 | NK | 2 | NK | | ||
- | | 2 | Themen | _ | NN | NN | _ | 14 | SB | 14 | SB | | ||
- | | 3 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | | ||
- | | 4 | die | _ | PRELS | PRELS | _ | 8 | OA | 8 | OA | | ||
- | | 5 | Perot | _ | NE | NE | _ | 8 | SB | 8 | SB | | ||
- | | 6 | immer | _ | ADV | ADV | _ | 7 | MO | 7 | MO | | ||
- | | 7 | wieder | _ | ADV | ADV | _ | 8 | MO | 8 | MO | | ||
- | | 8 | anspricht | _ | VVFIN | VVFIN | _ | 2 | RC | 2 | RC | | ||
- | | 9 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | | ||
- | | 10 | Rezession | _ | NN | NN | _ | 2 | APP | 2 | APP | | ||
- | | 11 | und | _ | KON | KON | _ | 10 | CD | 10 | CD | | ||
- | | 12 | Bürokratie | _ | NN | NN | _ | 10 | CJ | 10 | CJ | | ||
- | | 13 | , | _ | $, | $, | _ | 14 | PUNC | 14 | PUNC | | ||
- | | 14 | machen | _ | VVFIN | VVFIN | _ | 0 | ROOT | 0 | ROOT | | ||
- | | 15 | ihnen | _ | PPER | PPER | _ | 18 | DA | 18 | DA | | ||
- | | 16 | besonders | _ | ADV | ADV | _ | 18 | MO | 18 | MO | | ||
- | | 17 | zu | _ | PTKZU | PTKZU | _ | 18 | PM | 18 | PM | | ||
- | | 18 | schaffen | _ | VVINF | VVINF | _ | 14 | OC | 14 | OC | | ||
- | | 19 | . | _ | $. | $. | _ | 14 | PUNC | 14 | PUNC | | ||
- | |||
- | The first sentence of the CoNLL 2009 training data: | ||
- | |||
- | | 1 | `` | _ | `` | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | | ||
- | | 2 | Ross | Ross | Roß | NE | NN | Nom< | ||
- | | 3 | Perot | Perot | Perot | NE | NE | Nom< | ||
- | | 4 | wäre | sein | sein | VAFIN | VAFIN | 3< | ||
- | | 5 | vielleicht | vielleicht | vielleicht | ADV | ADV | _ | _ | 4 | 4 | MO | MO | _ | _ | | ||
- | | 6 | ein | ein | ein | ART | ART | Nom< | ||
- | | 7 | prächtiger | prächtig | prächtig | ADJA | ADJA | Pos< | ||
- | | 8 | Diktator | Diktator | Diktator | NN | NN | Nom< | ||
- | | 9 | < | ||
- | |||
- | The first sentence of the CoNLL 2009 development data: | ||
- | |||
- | | 1 | Maschinenbau | Maschinenbau | Maschinenbau | NN | NN | Nom< | ||
- | | 2 | / | _ | / | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | | ||
- | | 3 | ( | _ | ( | $( | $( | _ | _ | 0 | 4 | PUNC | PUNC | _ | _ | | ||
- | | 4 | Zusammenfassung | Zusammenfassung | Zusammenfassung | NN | NN | Nom< | ||
- | | 5 | ) | _ | ) | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | | ||
- | |||
- | The first sentence of the CoNLL 2009 test data: | ||
- | |||
- | | 1 | Gegen | gegen | gegen | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 2 | eine | ein | ein | ART | ART | Acc< | ||
- | | 3 | Erweiterung | Erweiterung | Erweiterung | NN | NN | Acc< | ||
- | | 4 | ihrer | ihr | ihr | PPOSAT | PPOSAT | Gen< | ||
- | | 5 | Organisation | Organisation | Organisation | NN | NN | Gen< | ||
- | | 6 | zu | zu | zu | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 7 | einem | ein | ein | ART | ART | Dat< | ||
- | | 8 | sicherheitspolitischen | sicherheitspolitisch | sicherheitspolitisch | ADJA | ADJA | Pos< | ||
- | | 9 | Forum | Forum | Forum | NN | NN | Dat< | ||
- | | 10 | sprachen | sprechen | sprechen | VVFIN | VVFIN | 3< | ||
- | | 11 | sich | sich | er< | ||
- | | 12 | die | der | d | ART | ART | Nom< | ||
- | | 13 | meisten | meister | meist | PIAT | PIAT | Nom< | ||
- | | 14 | Staaten | Staat | Staat | NN | NN | Nom< | ||
- | | 15 | beim | bei | beim | APPRART | APPRART | Dat< | ||
- | | 16 | Gipfeltreffen | Gipfeltreffen | Gipfeltreffen | NN | NN | Dat< | ||
- | | 17 | für | für | für | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 18 | Asiatisch-Pazifische | asiatisch-pazifisch | Asiatisch-Pazifische | ADJA | NN | Pos< | ||
- | | 19 | Wirtschaftskooperation | Wirtschaftskooperation | Wirtschaftskooperation | NN | NN | Acc< | ||
- | | 20 | ( | _ | ( | $( | $( | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 21 | Apec | Apec | _ | NE | NE | Nom< | ||
- | | 22 | ) | _ | ) | $( | $( | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 23 | in | in | in | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 24 | Osaka | Osaka | Osaka | NE | NE | Dat< | ||
- | | 25 | aus | aus | aus | PTKVZ | PTKVZ | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 26 | . | _ | . | $. | $. | _ | _ | _ | _ | _ | _ | _ | | ||
==== Parsing ==== | ==== Parsing ==== | ||
- | TIGER is a mildly nonprojective treebank. 15875 of the 680, | + | Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%). |
- | + | ||
- | The results of the CoNLL 2006 shared task are [[http:// | + | |
- | + | ||
- | ^ Parser (Authors) ^ LAS ^ UAS ^ | + | |
- | | MST (McDonald et al.) | 87.34 | 90.38 | | + | |
- | | Riedel et al. | 86.24 | 89.76 | | + | |
- | | Basis (O' | + | |
- | | Malt (Nivre et al.) | 85.82 | 88.76 | | + | |
- | + | ||
- | The results of the CoNLL 2009 shared task are [[http:// | + | |
- | ^ Parser (Authors) ^ LAS ^ | + | There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser. |
- | | Bohnet | 87.48 | | + | |
- | | Merlo | 87.29 | | + | |
- | | Chen | 86.24 | | + | |
- | | Che | 86.19 | | + | |