Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:et [2011/11/21 10:30]
zeman vytvořeno
+++ user:zeman:treebanks:et [2011/11/28 17:10] (current)
zeman New training/test data split.
@@ Line 1: / Line 1: @@
 ===== Estonian (et) =====
-[[http://vvv.cs.ut.ee/~kaili/Korpus/puud/|Eesti keele puudepank]] ([[http://translate.google.cz/translate?sl=et&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fvvv.cs.ut.ee%2F~kaili%2FKorpus%2Fpuud%2F&act=url|Google translate]])
+[[http://vvv.cs.ut.ee/~kaili/Korpus/puud/|Eesti keele puudepank]] ([[http://translate.google.cz/translate?sl=et&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fvvv.cs.ut.ee%2F~kaili%2FKorpus%2Fpuud%2F&act=url|Google translate]]) (EKP)
 ==== Versions ====
-  * TIGER Treebank 1 (2003)
+  * Downloadable on-line, part of Arborest project (puudepank)
-  * TIGER Treebank 2 (2005)
+  * 8.12.2010 arborest.xml downloadable from the same site (same size, improved markup)
-  * TIGER Treebank 2.1 (2007) in [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] or Negra export (text) format
+  * http://vvv.cs.ut.ee/~kaili/Korpus/pindmine/
-  * CoNLL 2006
-  * CoNLL 2009
 ==== Obtaining and License ====
-The TIGER Treebank is freely downloadable after you accept the [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/license/htmllicense.shtml|license terms]] by pressing a button.
+The EKP is freely [[http://vvv.cs.ut.ee/~kaili/Korpus/puud/|downloadable from here]] in [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|VISL]] or [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] format. Licensing terms are unknown.
-Republication of the two CoNLL versions in LDC is planned but it has not happenned yet.
+EKP was created / coordinated (?) by Kaili Müürisep, [[http://www.cs.ut.ee/|Institute of Computer Science]] (Arvutiteaduse instituut), University of Tartu (Tartu Ülikool), Liivi 2, 50409 Tartu, Estonia.
-The license in short:
-  * non-commercial research and evaluation usage by academic or educational institutions
-  * no redistribution
-  * acknowledge the use of the corpus in publications
-The TIGER Treebank was created by members of three institutes:
-  * [[http://www.coli.uni-saarland.de/|Department of Computational Linguistics and Phonetics]] (Computerlinguistik, CoLi), Saarland University (Universität des Saarlandes), Postfach 151150, D-66041 Saarbrücken, Germany.
-  * [[http://www.ims.uni-stuttgart.de/|Institute for Natural Language Processing]] (Institut für Maschinelle Sprachverarbeitung, IMS), University of Stuttgart (Universität Stuttgart), Azenbergstraße 12, D-70174 Stuttgart, Germany.
-  * [[http://www.uni-potsdam.de/germanistik/|German Department]] (Institut für Germanistik), Philosophische Fakultät, Universität Potsdam, Am Neuen Palais 10, Haus 05, D-14469 Potsdam, Germany.
 ==== References ====
   * Website
-    * http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/
+    * http://vvv.cs.ut.ee/~kaili/Korpus/puud/ ([[http://translate.google.cz/translate?sl=et&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fvvv.cs.ut.ee%2F~kaili%2FKorpus%2Fpuud%2F&act=url|Google translate]])
   * Data
     * //no separate citation//
   * Principal publications
-    * Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, George Smith: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/paper/treeling2002.pdf|The TIGER Treebank]]. In: Proceedings of the Workshop on Treebanks and Linguistic Theories (TLT), Sozopol, Bulgaria, 2002.
+    * Kaili Müürisep, Tiina Puolakainen, Kadri Muischnek, Mare Koit, Tiit Roosmaa, Heli Uibo: [[https://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/RANLP/papers/p16.pdf|A New Language for Constraint Grammar: Estonian]]. In: International Conference Recent Advances in Natural Language Processing. Proceedings, pp. 304-310, Borovets, Bulgaria, 2003.
-    * [[http://www.ims.uni-stuttgart.de/projekte/TIGER/paper/|List of publications]]
+  * Documentation
-  * [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/annotation/|Documentation]]
+    * [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|File formats]]
-    * [[http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html|Stuttgart-Tübingen Tagset]] (part of speech)
+    * The header of the TIGER-XML version of the treebank contains lists of various sorts of tags with brief explanation.
-    * Berthold Crysmann, Silvia Hansen-Schirra, George Smith, Dorothea Ziegler-Eisele: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/annotation/tiger_scheme-morph.pdf|TIGER Morphologie-Annotationsschema]], 2005.
-    * Stefanie Albert, Jan Anderssen, Regine Bader, Stephanie Becker, Tobias Bracht, Sabine Brants, Thorsten Brants, Vera Demberg, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Hagen Hirschmann, Juliane Janitzek, Carolin Kirstein, Robert Langner, Lukas Michelbacher, Oliver Plaehn, Cordula Preis, Marcus Pußel, Marco Rower, Bettina Schrader, Anne Schwartz, George Smith, Hans Uszkoreit: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/annotation/tiger_scheme-syntax.pdf|TIGER Annotationsschema]] //(syntax)//, 2003.
-    * The header of the XML version of the TIGER Treebank contains lists of various sorts of tags with brief explanation.
 ==== Domain ====
-Mostly newswire (Frankfurter Rundschau).
+Mixed:
+  * 388 tailored sentences with movement verbs
+  * 732 sentences with movement verbs from the Estonian FrameNet corpus
+  * 175 sentences from the Arborest corpus
+  * 20 sentences of spoken language
 ==== Size ====
-According to their website, the TIGER Treebank version 1 contains approximately 700,000 tokens in 40,000 sentences. Version 2.1 contains approximately 900,000 tokens in 50,000 sentences.
+All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data split, shown in the last two rows of the table.
-The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens / 39216 sentences training, 5694 tokens / 357 sentences test).
+^ File ^ Sentences ^ Terminals ^ Average t/s ^
+| arborest.xml |  175 |  2451 |  14.01 |
-The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, 31622 tokens / 2000 sentences test).
+| piialaused.xml |  732 |  4505 |  6.15 |
+| ratsepalaused.xml |  388 |  2348 |  6.05 |
+| sul.xml |  20 |  187 |  9.35 |
+| **total** |  **1315** |  **9491** |  **7.22** |
+| training |  1184 |  8535 |  7.21 |
+| test |  131 |  956 |  7.30 |
 ==== Inside ====
-All versions contain //semi-automatic// part of speech tags ([[http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html|Stuttgart-Tübingen Tagset]], STTS) and syntactic structure. Lemmas and morphosyntactic features are available only for newer versions (TIGER Treebank version 2 and onwards, and CoNLL 2009). The parts of speech are heavily context-dependent, e.g. many words can be used both substantively (pronouns) and attributively (determiners), which is distinguished by different POS tags.
+The treebank is part of the [[http://corp.hum.sdu.dk/tgrepeye_est.html|Arborest]] project and [[http://beta.visl.sdu.dk/|VISL]] (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] format. Two of them are also available in the [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|VISL]] format.
-It is not clear what the //semi-automatic// annotation means (probably first auto-tagging, then manual correction?) and whether it also applies to the morphosyntactic annotation. The CoNLL 2009 version also contains automatically disambiguated lemmas, tags and features.
+The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually.
-The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format.
+Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities.
 ==== Sample ====
-The first sentence of TIGER Treebank 2.1 in the TIGER-XML format:
+The first sentence of the corpus in the TIGER-XML format:
-<code xml><s id="s1">
+<code xml><s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama">
-  <graph root="s1_VROOT">
+	<graph root="ratsep-13_501">
-    <terminals>
+		<terminals>
-      <t id="s1_1" word="``" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
+			<t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/>
-      <t id="s1_2" word="Ross" lemma="Ross" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
+			<t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/>
-      <t id="s1_3" word="Perot" lemma="Perot" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
+			<t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/>
-      <t id="s1_4" word="wäre" lemma="sein" pos="VAFIN" morph="3.Sg.Past.Subj" case="--" number="Sg" gender="--" person="3" degree="--" tense="Past" mood="Subj" />
+			<t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/>
-      <t id="s1_5" word="vielleicht" lemma="vielleicht" pos="ADV" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
+			<t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/>
-      <t id="s1_6" word="ein" lemma="ein" pos="ART" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
+			<t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/>
-      <t id="s1_7" word="prächtiger" lemma="prächtig" pos="ADJA" morph="Pos.Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="Pos" tense="--" mood="--" />
+			<t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/>
-      <t id="s1_8" word="Diktator" lemma="Diktator" pos="NN" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" />
+		</terminals>
-      <t id="s1_9" word="''" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" />
-    </terminals>
+		<nonterminals>
-    <nonterminals>
+			<nt id="ratsep-13_501" cat="VROOT">
-      <nt id="s1_500" cat="PN">
+				<edge label="STA" idref="ratsep-13_502"/>
-        <edge label="PNC" idref="s1_2" />
+			</nt>
-        <edge label="PNC" idref="s1_3" />
+			<nt id="ratsep-13_502" cat="fcl">
-      </nt>
+				<edge label="S" idref="ratsep-13_1"/>
-      <nt id="s1_501" cat="NP">
+				<edge label="P" idref="ratsep-13_2"/>
-        <edge label="NK" idref="s1_6" />
+				<edge label="A" idref="ratsep-13_503"/>
-        <edge label="NK" idref="s1_7" />
+				<edge label="A" idref="ratsep-13_5"/>
-        <edge label="NK" idref="s1_8" />
+				<edge label="A" idref="ratsep-13_6"/>
-      </nt>
+				<edge label="FST" idref="ratsep-13_7"/>
-      <nt id="s1_502" cat="S">
+			</nt>
-        <edge label="SB" idref="s1_500" />
+			<nt id="ratsep-13_503" cat="pp">
-        <edge label="HD" idref="s1_4" />
+				<edge label="H" idref="ratsep-13_3"/>
-        <edge label="MO" idref="s1_5" />
+				<edge label="D" idref="ratsep-13_4"/>
-        <edge label="PD" idref="s1_501" />
+			</nt>
-      </nt>
+		</nonterminals>
-      <nt id="s1_VROOT" cat="VROOT">
+	</graph>
-        <edge label="--" idref="s1_1" />
-        <edge label="--" idref="s1_502" />
-        <edge label="--" idref="s1_9" />
-      </nt>
-    </nonterminals>
-  </graph>
 </s></code>
-The first sentence of the CoNLL 2006 training data:
-| 1 | `` | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC |
-| 2 | Ross | _ | NE | NE | _ | 4 | SB | 4 | SB |
-| 3 | Perot | _ | NE | NE | _ | 2 | PNC | 2 | PNC |
-| 4 | wäre | _ | VAFIN | VAFIN | _ | 0 | ROOT | 0 | ROOT |
-| 5 | vielleicht | _ | ADV | ADV | _ | 4 | MO | 4 | MO |
-| 6 | ein | _ | ART | ART | _ | 8 | NK | 8 | NK |
-| 7 | prächtiger | _ | ADJA | ADJA | _ | 8 | NK | 8 | NK |
-| 8 | Diktator | _ | NN | NN | _ | 4 | PD | 4 | PD |
-| 9 | <nowiki>''</nowiki> | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC |
-The first sentence of the CoNLL 2006 test data:
-| 1 | Zwei | _ | CARD | CARD | _ | 2 | NK | 2 | NK |
-| 2 | Themen | _ | NN | NN | _ | 14 | SB | 14 | SB |
-| 3 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC |
-| 4 | die | _ | PRELS | PRELS | _ | 8 | OA | 8 | OA |
-| 5 | Perot | _ | NE | NE | _ | 8 | SB | 8 | SB |
-| 6 | immer | _ | ADV | ADV | _ | 7 | MO | 7 | MO |
-| 7 | wieder | _ | ADV | ADV | _ | 8 | MO | 8 | MO |
-| 8 | anspricht | _ | VVFIN | VVFIN | _ | 2 | RC | 2 | RC |
-| 9 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC |
-| 10 | Rezession | _ | NN | NN | _ | 2 | APP | 2 | APP |
-| 11 | und | _ | KON | KON | _ | 10 | CD | 10 | CD |
-| 12 | Bürokratie | _ | NN | NN | _ | 10 | CJ | 10 | CJ |
-| 13 | , | _ | $, | $, | _ | 14 | PUNC | 14 | PUNC |
-| 14 | machen | _ | VVFIN | VVFIN | _ | 0 | ROOT | 0 | ROOT |
-| 15 | ihnen | _ | PPER | PPER | _ | 18 | DA | 18 | DA |
-| 16 | besonders | _ | ADV | ADV | _ | 18 | MO | 18 | MO |
-| 17 | zu | _ | PTKZU | PTKZU | _ | 18 | PM | 18 | PM |
-| 18 | schaffen | _ | VVINF | VVINF | _ | 14 | OC | 14 | OC |
-| 19 | . | _ | $. | $. | _ | 14 | PUNC | 14 | PUNC |
-The first sentence of the CoNLL 2009 training data:
-| 1 | `` | _ | `` | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ |
-| 2 | Ross | Ross | Roß | NE | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | _ | 3 | 3 | PNC | PNC | _ | _ |
-| 3 | Perot | Perot | Perot | NE | NE | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | _ | 4 | 4 | SB | SB | _ | _ |
-| 4 | wäre | sein | sein | VAFIN | VAFIN | 3<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Past<nowiki>|</nowiki>Subj | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Past<nowiki>|</nowiki>Subj | 0 | 0 | ROOT | ROOT | _ | _ |
-| 5 | vielleicht | vielleicht | vielleicht | ADV | ADV | _ | _ | 4 | 4 | MO | MO | _ | _ |
-| 6 | ein | ein | ein | ART | ART | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | 8 | 8 | NK | NK | _ | _ |
-| 7 | prächtiger | prächtig | prächtig | ADJA | ADJA | Pos<nowiki>|</nowiki>Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>*<nowiki>|</nowiki>* | 8 | 8 | NK | NK | _ | _ |
-| 8 | Diktator | Diktator | Diktator | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | 4 | 4 | PD | PD | _ | _ |
-| 9 | <nowiki>''</nowiki> | _ | <nowiki>''</nowiki> | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ |
-The first sentence of the CoNLL 2009 development data:
-| 1 | Maschinenbau | Maschinenbau | Maschinenbau | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | 0 | 4 | ROOT | NK | _ | _ |
-| 2 | / | _ | / | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ |
-| 3 | ( | _ | ( | $( | $( | _ | _ | 0 | 4 | PUNC | PUNC | _ | _ |
-| 4 | Zusammenfassung | Zusammenfassung | Zusammenfassung | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | 0 | 0 | ROOT | ROOT | _ | _ |
-| 5 | ) | _ | ) | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ |
-The first sentence of the CoNLL 2009 test data:
-| 1 | Gegen | gegen | gegen | APPR | APPR | _ | _ | _ | _ | _ | _ | _ |
-| 2 | eine | ein | ein | ART | ART | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ |
-| 3 | Erweiterung | Erweiterung | Erweiterung | NN | NN | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ |
-| 4 | ihrer | ihr | ihr | PPOSAT | PPOSAT | Gen<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 5 | Organisation | Organisation | Organisation | NN | NN | Gen<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ |
-| 6 | zu | zu | zu | APPR | APPR | _ | _ | _ | _ | _ | _ | _ |
-| 7 | einem | ein | ein | ART | ART | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 8 | sicherheitspolitischen | sicherheitspolitisch | sicherheitspolitisch | ADJA | ADJA | Pos<nowiki>|</nowiki>Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Pos<nowiki>|</nowiki>*<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 9 | Forum | Forum | Forum | NN | NN | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ |
-| 10 | sprachen | sprechen | sprechen | VVFIN | VVFIN | 3<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Past<nowiki>|</nowiki>Ind | *<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Past<nowiki>|</nowiki>Ind | _ | _ | _ | _ | Y |
-| 11 | sich | sich | er<nowiki>|</nowiki>es<nowiki>|</nowiki>sie<nowiki>|</nowiki>Sie | PRF | PRF | 3<nowiki>|</nowiki>Acc<nowiki>|</nowiki>Pl | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 12 | die | der | d | ART | ART | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 13 | meisten | meister | meist | PIAT | PIAT | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 14 | Staaten | Staat | Staat | NN | NN | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | _ | _ | _ | _ | _ |
-| 15 | beim | bei | beim | APPRART | APPRART | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 16 | Gipfeltreffen | Gipfeltreffen | Gipfeltreffen | NN | NN | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>*<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ |
-| 17 | für | für | für | APPR | APPR | _ | _ | _ | _ | _ | _ | _ |
-| 18 | Asiatisch-Pazifische | asiatisch-pazifisch | Asiatisch-Pazifische | ADJA | NN | Pos<nowiki>|</nowiki>Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ |
-| 19 | Wirtschaftskooperation | Wirtschaftskooperation | Wirtschaftskooperation | NN | NN | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ |
-| 20 | ( | _ | ( | $( | $( | _ | _ | _ | _ | _ | _ | _ |
-| 21 | Apec | Apec | _ | NE | NE | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | _ |
-| 22 | ) | _ | ) | $( | $( | _ | _ | _ | _ | _ | _ | _ |
-| 23 | in | in | in | APPR | APPR | _ | _ | _ | _ | _ | _ | _ |
-| 24 | Osaka | Osaka | Osaka | NE | NE | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ |
-| 25 | aus | aus | aus | PTKVZ | PTKVZ | _ | _ | _ | _ | _ | _ | _ |
-| 26 | . | _ | . | $. | $. | _ | _ | _ | _ | _ | _ | _ |
 ==== Parsing ====
-TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets are attached nonprojectively (2.33%).
+Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%).
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for German:
-^ Parser (Authors) ^ LAS ^ UAS ^
-| MST (McDonald et al.) | 87.34 | 90.38 |
-| Riedel et al. | 86.24 | 89.76 |
-| Basis (O'Neil) | 85.36 | 89.16 |
-| Malt (Nivre et al.) | 85.82 | 88.76 |
-The results of the CoNLL 2009 shared task are [[http://ufal.mff.cuni.cz/conll2009-st/results/results.php|available online]]. They have been published in [[http://aclweb.org/anthology/W/W09/W09-1201.pdf|(Hajič et al., 2009)]]. Unlabeled attachment score was not published. These are the best results for German:
-^ Parser (Authors) ^ LAS ^
+There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser.
-| Bohnet | 87.48 |
-| Merlo | 87.29 |
-| Chen | 86.24 |
-| Che | 86.19 |

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences