Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
user:zeman:treebanks:et [2011/11/21 10:30] zeman vytvořeno |
user:zeman:treebanks:et [2011/11/28 17:10] (current) zeman New training/test data split. |
||
---|---|---|---|
Line 1: | Line 1: | ||
===== Estonian (et) ===== | ===== Estonian (et) ===== | ||
- | [[http:// | + | [[http:// |
==== Versions ==== | ==== Versions ==== | ||
- | * TIGER Treebank 1 (2003) | + | * Downloadable on-line, part of Arborest project |
- | * TIGER Treebank 2 (2005) | + | * 8.12.2010 arborest.xml downloadable from the same site (same size, improved markup) |
- | * TIGER Treebank 2.1 (2007) in [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/ | + | * http://vvv.cs.ut.ee/~kaili/Korpus/pindmine/ |
- | * CoNLL 2006 | + | |
- | * CoNLL 2009 | + | |
==== Obtaining and License ==== | ==== Obtaining and License ==== | ||
- | The TIGER Treebank | + | The EKP is freely |
- | Republication of the two CoNLL versions in LDC is planned but it has not happenned yet. | + | EKP was created / coordinated |
- | + | ||
- | The license in short: | + | |
- | + | ||
- | * non-commercial research and evaluation usage by academic or educational institutions | + | |
- | * no redistribution | + | |
- | * acknowledge the use of the corpus in publications | + | |
- | + | ||
- | The TIGER Treebank | + | |
- | * [[http:// | + | |
- | * [[http:// | + | |
- | * [[http:// | + | |
==== References ==== | ==== References ==== | ||
* Website | * Website | ||
- | * http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/ | + | * http://vvv.cs.ut.ee/~kaili/Korpus/puud/ ([[http:// |
* Data | * Data | ||
* //no separate citation// | * //no separate citation// | ||
* Principal publications | * Principal publications | ||
- | * Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, George Smith: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/paper/treeling2002.pdf|The TIGER Treebank]]. In: Proceedings | + | * Kaili Müürisep, Tiina Puolakainen, Kadri Muischnek, Mare Koit, Tiit Roosmaa, Heli Uibo: [[https://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/RANLP/ |
- | * [[http:// | + | * Documentation |
- | * [[http:// | + | * [[http://beta.visl.sdu.dk/treebanks.html# |
- | * [[http://www.ims.uni-stuttgart.de/projekte/ | + | * The header of the TIGER-XML version of the treebank |
- | * Berthold Crysmann, Silvia Hansen-Schirra, | + | |
- | * Stefanie Albert, Jan Anderssen, Regine Bader, Stephanie Becker, Tobias Bracht, Sabine Brants, Thorsten Brants, Vera Demberg, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Hagen Hirschmann, Juliane Janitzek, Carolin Kirstein, Robert Langner, Lukas Michelbacher, | + | |
- | * The header of the XML version of the TIGER Treebank | + | |
==== Domain ==== | ==== Domain ==== | ||
- | Mostly newswire (Frankfurter Rundschau). | + | Mixed: |
+ | * 388 tailored sentences with movement verbs | ||
+ | * 732 sentences with movement verbs from the Estonian FrameNet corpus | ||
+ | * 175 sentences from the Arborest corpus | ||
+ | * 20 sentences of spoken language | ||
==== Size ==== | ==== Size ==== | ||
- | According to their website, | + | All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data split, shown in the last two rows of the table. |
- | The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens | + | ^ File ^ Sentences ^ Terminals ^ Average t/s ^ |
- | + | | arborest.xml | 175 | 2451 | 14.01 | | |
- | The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences | + | | piialaused.xml | 732 | 4505 | 6.15 | |
+ | | ratsepalaused.xml | 388 | 2348 | 6.05 | | ||
+ | | sul.xml | 20 | 187 | 9.35 | | ||
+ | | **total** | **1315** | **9491** | **7.22** | | ||
+ | | training | ||
+ | | test | 131 | 956 | 7.30 | | ||
==== Inside ==== | ==== Inside ==== | ||
- | All versions contain | + | The treebank is part of the [[http://corp.hum.sdu.dk/tgrepeye_est.html|Arborest]] project and [[http:// |
- | It is not clear what the // | + | The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. |
- | The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format. | + | Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities. |
==== Sample ==== | ==== Sample ==== | ||
- | The first sentence of TIGER Treebank 2.1 in the TIGER-XML format: | + | The first sentence of the corpus |
- | <code xml>< | + | <code xml>< |
- | <graph root=" | + | <graph root="ratsep-13_501"> |
- | < | + | < |
- | <t id=" | + | <t id="ratsep-13_1" word="Peeter" lemma=" |
- | <t id=" | + | <t id="ratsep-13_2" word="aerutas" lemma=" |
- | <t id="s1_3" word="Perot" lemma=" | + | <t id="ratsep-13_3" word="üle" lemma=" |
- | <t id="s1_4" word="wäre" lemma=" | + | <t id="ratsep-13_4" word="väina" lemma=" |
- | <t id="s1_5" word="vielleicht" lemma=" | + | <t id="ratsep-13_5" word="saarele" lemma=" |
- | <t id="s1_6" word="ein" lemma=" | + | <t id="ratsep-13_6" word="puhkama" lemma=" |
- | <t id="s1_7" word="prächtiger" lemma=" | + | <t id="ratsep-13_7" word="." lemma=" |
- | <t id="s1_8" word="Diktator" lemma=" | + | </ |
- | <t id="s1_9" word="'' | + | |
- | </ | + | < |
- | < | + | <nt id="ratsep-13_501" cat="VROOT"> |
- | <nt id="s1_500" cat="PN"> | + | <edge label=" |
- | <edge label=" | + | </ |
- | < | + | <nt id="ratsep-13_502" cat="fcl"> |
- | | + | <edge label=" |
- | <nt id="s1_501" cat="NP"> | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | </ | + | <edge label=" |
- | <nt id=" | + | </ |
- | | + | <nt id="ratsep-13_503" cat="pp"> |
- | <edge label=" | + | <edge label=" |
- | <edge label=" | + | <edge label=" |
- | < | + | </ |
- | | + | </ |
- | <nt id="s1_VROOT" cat="VROOT"> | + | </ |
- | <edge label=" | + | |
- | <edge label=" | + | |
- | <edge label=" | + | |
- | </ | + | |
- | </ | + | |
- | </ | + | |
</ | </ | ||
- | |||
- | The first sentence of the CoNLL 2006 training data: | ||
- | |||
- | | 1 | `` | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | | ||
- | | 2 | Ross | _ | NE | NE | _ | 4 | SB | 4 | SB | | ||
- | | 3 | Perot | _ | NE | NE | _ | 2 | PNC | 2 | PNC | | ||
- | | 4 | wäre | _ | VAFIN | VAFIN | _ | 0 | ROOT | 0 | ROOT | | ||
- | | 5 | vielleicht | _ | ADV | ADV | _ | 4 | MO | 4 | MO | | ||
- | | 6 | ein | _ | ART | ART | _ | 8 | NK | 8 | NK | | ||
- | | 7 | prächtiger | _ | ADJA | ADJA | _ | 8 | NK | 8 | NK | | ||
- | | 8 | Diktator | _ | NN | NN | _ | 4 | PD | 4 | PD | | ||
- | | 9 | < | ||
- | |||
- | The first sentence of the CoNLL 2006 test data: | ||
- | |||
- | | 1 | Zwei | _ | CARD | CARD | _ | 2 | NK | 2 | NK | | ||
- | | 2 | Themen | _ | NN | NN | _ | 14 | SB | 14 | SB | | ||
- | | 3 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | | ||
- | | 4 | die | _ | PRELS | PRELS | _ | 8 | OA | 8 | OA | | ||
- | | 5 | Perot | _ | NE | NE | _ | 8 | SB | 8 | SB | | ||
- | | 6 | immer | _ | ADV | ADV | _ | 7 | MO | 7 | MO | | ||
- | | 7 | wieder | _ | ADV | ADV | _ | 8 | MO | 8 | MO | | ||
- | | 8 | anspricht | _ | VVFIN | VVFIN | _ | 2 | RC | 2 | RC | | ||
- | | 9 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | | ||
- | | 10 | Rezession | _ | NN | NN | _ | 2 | APP | 2 | APP | | ||
- | | 11 | und | _ | KON | KON | _ | 10 | CD | 10 | CD | | ||
- | | 12 | Bürokratie | _ | NN | NN | _ | 10 | CJ | 10 | CJ | | ||
- | | 13 | , | _ | $, | $, | _ | 14 | PUNC | 14 | PUNC | | ||
- | | 14 | machen | _ | VVFIN | VVFIN | _ | 0 | ROOT | 0 | ROOT | | ||
- | | 15 | ihnen | _ | PPER | PPER | _ | 18 | DA | 18 | DA | | ||
- | | 16 | besonders | _ | ADV | ADV | _ | 18 | MO | 18 | MO | | ||
- | | 17 | zu | _ | PTKZU | PTKZU | _ | 18 | PM | 18 | PM | | ||
- | | 18 | schaffen | _ | VVINF | VVINF | _ | 14 | OC | 14 | OC | | ||
- | | 19 | . | _ | $. | $. | _ | 14 | PUNC | 14 | PUNC | | ||
- | |||
- | The first sentence of the CoNLL 2009 training data: | ||
- | |||
- | | 1 | `` | _ | `` | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | | ||
- | | 2 | Ross | Ross | Roß | NE | NN | Nom< | ||
- | | 3 | Perot | Perot | Perot | NE | NE | Nom< | ||
- | | 4 | wäre | sein | sein | VAFIN | VAFIN | 3< | ||
- | | 5 | vielleicht | vielleicht | vielleicht | ADV | ADV | _ | _ | 4 | 4 | MO | MO | _ | _ | | ||
- | | 6 | ein | ein | ein | ART | ART | Nom< | ||
- | | 7 | prächtiger | prächtig | prächtig | ADJA | ADJA | Pos< | ||
- | | 8 | Diktator | Diktator | Diktator | NN | NN | Nom< | ||
- | | 9 | < | ||
- | |||
- | The first sentence of the CoNLL 2009 development data: | ||
- | |||
- | | 1 | Maschinenbau | Maschinenbau | Maschinenbau | NN | NN | Nom< | ||
- | | 2 | / | _ | / | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | | ||
- | | 3 | ( | _ | ( | $( | $( | _ | _ | 0 | 4 | PUNC | PUNC | _ | _ | | ||
- | | 4 | Zusammenfassung | Zusammenfassung | Zusammenfassung | NN | NN | Nom< | ||
- | | 5 | ) | _ | ) | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | | ||
- | |||
- | The first sentence of the CoNLL 2009 test data: | ||
- | |||
- | | 1 | Gegen | gegen | gegen | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 2 | eine | ein | ein | ART | ART | Acc< | ||
- | | 3 | Erweiterung | Erweiterung | Erweiterung | NN | NN | Acc< | ||
- | | 4 | ihrer | ihr | ihr | PPOSAT | PPOSAT | Gen< | ||
- | | 5 | Organisation | Organisation | Organisation | NN | NN | Gen< | ||
- | | 6 | zu | zu | zu | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 7 | einem | ein | ein | ART | ART | Dat< | ||
- | | 8 | sicherheitspolitischen | sicherheitspolitisch | sicherheitspolitisch | ADJA | ADJA | Pos< | ||
- | | 9 | Forum | Forum | Forum | NN | NN | Dat< | ||
- | | 10 | sprachen | sprechen | sprechen | VVFIN | VVFIN | 3< | ||
- | | 11 | sich | sich | er< | ||
- | | 12 | die | der | d | ART | ART | Nom< | ||
- | | 13 | meisten | meister | meist | PIAT | PIAT | Nom< | ||
- | | 14 | Staaten | Staat | Staat | NN | NN | Nom< | ||
- | | 15 | beim | bei | beim | APPRART | APPRART | Dat< | ||
- | | 16 | Gipfeltreffen | Gipfeltreffen | Gipfeltreffen | NN | NN | Dat< | ||
- | | 17 | für | für | für | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 18 | Asiatisch-Pazifische | asiatisch-pazifisch | Asiatisch-Pazifische | ADJA | NN | Pos< | ||
- | | 19 | Wirtschaftskooperation | Wirtschaftskooperation | Wirtschaftskooperation | NN | NN | Acc< | ||
- | | 20 | ( | _ | ( | $( | $( | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 21 | Apec | Apec | _ | NE | NE | Nom< | ||
- | | 22 | ) | _ | ) | $( | $( | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 23 | in | in | in | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 24 | Osaka | Osaka | Osaka | NE | NE | Dat< | ||
- | | 25 | aus | aus | aus | PTKVZ | PTKVZ | _ | _ | _ | _ | _ | _ | _ | | ||
- | | 26 | . | _ | . | $. | $. | _ | _ | _ | _ | _ | _ | _ | | ||
==== Parsing ==== | ==== Parsing ==== | ||
- | TIGER is a mildly nonprojective treebank. 15875 of the 680, | + | Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%). |
- | + | ||
- | The results of the CoNLL 2006 shared task are [[http:// | + | |
- | + | ||
- | ^ Parser (Authors) ^ LAS ^ UAS ^ | + | |
- | | MST (McDonald et al.) | 87.34 | 90.38 | | + | |
- | | Riedel et al. | 86.24 | 89.76 | | + | |
- | | Basis (O' | + | |
- | | Malt (Nivre et al.) | 85.82 | 88.76 | | + | |
- | + | ||
- | The results of the CoNLL 2009 shared task are [[http:// | + | |
- | ^ Parser (Authors) ^ LAS ^ | + | There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser. |
- | | Bohnet | 87.48 | | + | |
- | | Merlo | 87.29 | | + | |
- | | Chen | 86.24 | | + | |
- | | Che | 86.19 | | + | |