Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:fi [2011/12/05 13:38]
zeman vytvořeno
+++ user:zeman:treebanks:fi [2011/12/05 14:46]
zeman Sample.
@@ Line 20: / Line 20: @@
   * Website
-    * http://vvv.cs.ut.ee/~kaili/Korpus/puud/ ([[http://translate.google.cz/translate?sl=et&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fvvv.cs.ut.ee%2F~kaili%2FKorpus%2Fpuud%2F&act=url|Google translate]])
+    * http://bionlp.utu.fi/fintreebank.html
   * Data
     * //no separate citation//
   * Principal publications
-    * Kaili Müürisep, Tiina Puolakainen, Kadri Muischnek, Mare Koit, Tiit Roosmaa, Heli Uibo: [[https://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/RANLP/papers/p16.pdf|A New Language for Constraint Grammar: Estonian]]. In: International Conference Recent Advances in Natural Language Processing. Proceedings, pp. 304-310, Borovets, Bulgaria, 2003.
+    * Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen, Tapio Salakoski: [[http://bionlp.utu.fi/sites/default/files/haverinen-et-al-2009.pdf|Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank]]. In: Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). Milano, Italy, 2009.
+    * Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, Tapio Salakoski: [[http://dspace.utlib.ee/dspace/handle/10062/15936|Treebanking Finnish]]. In: Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. Tartu, Estonia, 2010.
   * Documentation
-    * [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|File formats]]
+    * The file FILE-FORMAT.txt in the distribution
-    * The header of the TIGER-XML version of the treebank contains lists of various sorts of tags with brief explanation.
+    * [[http://www2.lingsoft.fi/doc/fintwol/intro/tags.html|Partial list of part-of-speech tags with descriptions]] (POS tagging has been done by www.lingsoft.fi)
 ==== Domain ====
-Mixed:
+Mixed (Wikipedia, Wikinews, university web-magazine and blogs).
-  * 388 tailored sentences with movement verbs
-  * 732 sentences with movement verbs from the Estonian FrameNet corpus
-  * 175 sentences from the Arborest corpus
-  * 20 sentences of spoken language
 ==== Size ====
-All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data split, shown in the last two rows of the table.
+TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 90&nbsp;% (53151 tokens / 3877 sentences) for training and the remaining 10&nbsp;% (5425 tokens / 430 sentences) for testing.
-^ File ^ Sentences ^ Terminals ^ Average t/s ^
-| arborest.xml |  175 |  2451 |  14.01 |
-| piialaused.xml |  732 |  4505 |  6.15 |
-| ratsepalaused.xml |  388 |  2348 |  6.05 |
-| sul.xml |  20 |  187 |  9.35 |
-| **total** |  **1315** |  **9491** |  **7.22** |
-| training |  1184 |  8535 |  7.21 |
-| test |  131 |  956 |  7.30 |
 ==== Inside ====
@@ Line 60: / Line 48: @@
 ==== Sample ====
-The first sentence of the corpus in the TIGER-XML format:
+The first two sentences of the corpus in its native XML format:
+<code xml><treeset name="http://ranneliike.net/blogi.php?nick=Aboa Kirjoitettu: 02.02.2010, 15:41:06">
+  <sentence txt="Kävelyreitti III">
+    <token charOff="0-12">
+      <posreading CG="true" baseform="kävely#reitti" rawtags="N NOM SG &lt;up&gt;" />
+    </token>
+    <token charOff="13-16">
+      <posreading CG="true" baseform="III" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" />
+      <posreading CG="true" baseform="iii" rawtags="ABBR &lt;up&gt;" />
+      <posreading CG="true" baseform="iii" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" />
+    </token>
+    <dep dep="1" gov="0" type="num" />
+  </sentence>
+  <sentence txt="Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin.">
+    <token charOff="0-6">
+      <posreading CG="true" baseform="jää" rawtags="N ADE SG &lt;up&gt;" />
+    </token>
+    <token charOff="7-13">
+      <posreading CG="true" baseform="kävely" rawtags="DV-U N NOM SG" />
+    </token>
+    <token charOff="14-18">
+      <posreading CG="true" baseform="avata" rawtags="V PRES ACT SG3" />
+      <posreading CG="false" baseform="avata" rawtags="V PRES ACT NEG" />
+      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT SG2" />
+      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT NEG" />
+    </token>
+    <token charOff="19-23">
+      <posreading CG="true" baseform="aina" rawtags="ADV" />
+    </token>
+    <token charOff="24-32">
+      <posreading CG="true" baseform="hauska" rawtags="A POS PTV PL" />
+    </token>
+    <token charOff="33-35">
+      <posreading CG="true" baseform="ja" rawtags="COORD C" />
+    </token>
+    <token charOff="36-45">
+      <posreading CG="true" baseform="erikoinen" rawtags="A POS PTV PL" />
+    </token>
+    <token charOff="46-56">
+      <posreading CG="true" baseform="näkö#kulma" rawtags="N PTV PL" />
+    </token>
+    <token charOff="57-67">
+      <posreading CG="true" baseform="kaupunki" rawtags="N ILL SG" />
+    </token>
+    <token charOff="67-68">
+      <posreading CG="true" baseform="." rawtags="PUNCT" />
+    </token>
+    <dep dep="0" gov="1" type="nommod" />
+    <dep dep="1" gov="2" type="nsubj" />
+    <dep dep="3" gov="2" type="advmod" />
+    <dep dep="7" gov="2" type="dobj" />
+    <dep dep="9" gov="2" type="punct" />
+    <dep dep="5" gov="4" type="cc" />
+    <dep dep="6" gov="4" type="conj" />
+    <dep dep="4" gov="7" type="amod" />
+    <dep dep="8" gov="7" type="nommod" />
+  </sentence></code>
-<code xml><s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama">
+The same two sentences in the CoNLL format:
-	<graph root="ratsep-13_501">
-		<terminals>
-			<t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/>
-			<t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/>
-			<t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/>
-			<t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/>
-			<t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/>
-			<t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/>
-			<t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/>
-		</terminals>
-		<nonterminals>
+| # b101.d.xml/1 ||||||||||
-			<nt id="ratsep-13_501" cat="VROOT">
+| 1 | Kävelyreitti | kävely<nowiki>|</nowiki>reitti | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 0 | ROOT | _ | _ |
-				<edge label="STA" idref="ratsep-13_502"/>
+| 2 | III | III | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | _ | 1 | num | _ | _ |
-			</nt>
+| ||||||||||
-			<nt id="ratsep-13_502" cat="fcl">
+| # b101.d.xml/2 ||||||||||
-				<edge label="S" idref="ratsep-13_1"/>
+| 1 | Jäällä | jää | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | _ | 2 | nommod | _ | _ |
-				<edge label="P" idref="ratsep-13_2"/>
+| 2 | kävely | kävely | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 3 | nsubj | _ | _ |
-				<edge label="A" idref="ratsep-13_503"/>
+| 3 | avaa | avata | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | _ | 0 | ROOT | _ | _ |
-				<edge label="A" idref="ratsep-13_5"/>
+| 4 | aina | aina | ADV | ADV | _ | 3 | advmod | _ | _ |
-				<edge label="A" idref="ratsep-13_6"/>
+| 5 | hauskoja | hauska | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 8 | amod | _ | _ |
-				<edge label="FST" idref="ratsep-13_7"/>
+| 6 | ja | ja | C<nowiki>|</nowiki>COORD | C<nowiki>|</nowiki>COORD | _ | 5 | cc | _ | _ |
-			</nt>
+| 7 | erikoisia | erikoinen | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 5 | conj | _ | _ |
-			<nt id="ratsep-13_503" cat="pp">
+| 8 | näkökulmia | näkö<nowiki>|</nowiki>kulma | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | _ | 3 | dobj | _ | _ |
-				<edge label="H" idref="ratsep-13_3"/>
+| 9 | kaupunkiin | kaupunki | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 8 | nommod | _ | _ |
-				<edge label="D" idref="ratsep-13_4"/>
+| 10 | . | . | PUNCT | PUNCT | _ | 3 | punct | _ | _ |
-			</nt>
-		</nonterminals>
-	</graph>
-</s></code>
 ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences