[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Last revision Both sides next revision
user:zeman:treebanks:fi [2011/12/05 13:38]
zeman vytvořeno
user:zeman:treebanks:fi [2011/12/05 14:46]
zeman Sample.
Line 20: Line 20:
  
   * Website   * Website
-    * http://vvv.cs.ut.ee/~kaili/Korpus/puud/ ([[http://translate.google.cz/translate?sl=et&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fvvv.cs.ut.ee%2F~kaili%2FKorpus%2Fpuud%2F&act=url|Google translate]])+    * http://bionlp.utu.fi/fintreebank.html
   * Data   * Data
     * //no separate citation//     * //no separate citation//
   * Principal publications   * Principal publications
-    * Kaili MüürisepTiina PuolakainenKadri MuischnekMare KoitTiit Roosmaa, Heli Uibo: [[https://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/RANLP/papers/p16.pdf|A New Language for Constraint GrammarEstonian]]. In: International Conference Recent Advances in Natural Language Processing. Proceedings, pp. 304-310, BorovetsBulgaria2003.+    * Katri HaverinenFilip GinterVeronika LaippalaTimo ViljanenTapio Salakoski: [[http://bionlp.utu.fi/sites/default/files/haverinen-et-al-2009.pdf|Dependency Annotation of WikipediaFirst Steps Towards a Finnish Treebank]]. In: Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8)Milano, Italy, 2009. 
 +    * Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, Tapio Salakoski: [[http://dspace.utlib.ee/dspace/handle/10062/15936|Treebanking Finnish]]. In: Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. TartuEstonia2010.
   * Documentation   * Documentation
-    * [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|File formats]] +    * The file FILE-FORMAT.txt in the distribution 
-    * The header of the TIGER-XML version of the treebank contains lists of various sorts of tags with brief explanation.+    * [[http://www2.lingsoft.fi/doc/fintwol/intro/tags.html|Partial list of part-of-speech tags with descriptions]] (POS tagging has been done by www.lingsoft.fi)
  
 ==== Domain ==== ==== Domain ====
  
-Mixed+Mixed (Wikipedia, Wikinews, university web-magazine and blogs).
-  * 388 tailored sentences with movement verbs +
-  * 732 sentences with movement verbs from the Estonian FrameNet corpus +
-  * 175 sentences from the Arborest corpus +
-  * 20 sentences of spoken language+
  
 ==== Size ==== ==== Size ====
  
-All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data splitshown in the last two rows of the table. +TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experimentswe took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens 430 sentences) for testing.
- +
-^ File ^ Sentences ^ Terminals ^ Average t/s ^ +
-| arborest.xml |  175 |  2451 |  14.01 | +
-| piialaused.xml |  732 |  4505 |  6.15 | +
-| ratsepalaused.xml |  388 |  2348 |  6.05 | +
-| sul.xml |  20 |  187 |  9.35 | +
-| **total** |  **1315** |  **9491** |  **7.22** | +
-| training |  1184 |  8535 |  7.21 | +
-| test |  131 |  956 |  7.30 |+
  
 ==== Inside ==== ==== Inside ====
Line 60: Line 48:
 ==== Sample ==== ==== Sample ====
  
-The first sentence of the corpus in the TIGER-XML format:+The first two sentences of the corpus in its native XML format: 
 + 
 +<code xml><treeset name="http://ranneliike.net/blogi.php?nick=Aboa Kirjoitettu: 02.02.2010, 15:41:06"> 
 +  <sentence txt="Kävelyreitti III"> 
 +    <token charOff="0-12"> 
 +      <posreading CG="true" baseform="kävely#reitti" rawtags="N NOM SG &lt;up&gt;" /> 
 +    </token> 
 +    <token charOff="13-16"> 
 +      <posreading CG="true" baseform="III" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" /> 
 +      <posreading CG="true" baseform="iii" rawtags="ABBR &lt;up&gt;" /> 
 +      <posreading CG="true" baseform="iii" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" /> 
 +    </token> 
 +    <dep dep="1" gov="0" type="num" /> 
 +  </sentence> 
 +  <sentence txt="Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin."> 
 +    <token charOff="0-6"> 
 +      <posreading CG="true" baseform="jää" rawtags="N ADE SG &lt;up&gt;" /> 
 +    </token> 
 +    <token charOff="7-13"> 
 +      <posreading CG="true" baseform="kävely" rawtags="DV-U N NOM SG" /> 
 +    </token> 
 +    <token charOff="14-18"> 
 +      <posreading CG="true" baseform="avata" rawtags="V PRES ACT SG3" /> 
 +      <posreading CG="false" baseform="avata" rawtags="V PRES ACT NEG" /> 
 +      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT SG2" /> 
 +      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT NEG" /> 
 +    </token> 
 +    <token charOff="19-23"> 
 +      <posreading CG="true" baseform="aina" rawtags="ADV" /> 
 +    </token> 
 +    <token charOff="24-32"> 
 +      <posreading CG="true" baseform="hauska" rawtags="A POS PTV PL" /> 
 +    </token> 
 +    <token charOff="33-35"> 
 +      <posreading CG="true" baseform="ja" rawtags="COORD C" /> 
 +    </token> 
 +    <token charOff="36-45"> 
 +      <posreading CG="true" baseform="erikoinen" rawtags="A POS PTV PL" /> 
 +    </token> 
 +    <token charOff="46-56"> 
 +      <posreading CG="true" baseform="näkö#kulma" rawtags="N PTV PL" /> 
 +    </token> 
 +    <token charOff="57-67"> 
 +      <posreading CG="true" baseform="kaupunki" rawtags="N ILL SG" /> 
 +    </token> 
 +    <token charOff="67-68"> 
 +      <posreading CG="true" baseform="." rawtags="PUNCT" /> 
 +    </token> 
 +    <dep dep="0" gov="1" type="nommod" /> 
 +    <dep dep="1" gov="2" type="nsubj" /> 
 +    <dep dep="3" gov="2" type="advmod" /> 
 +    <dep dep="7" gov="2" type="dobj" /> 
 +    <dep dep="9" gov="2" type="punct" /> 
 +    <dep dep="5" gov="4" type="cc" /> 
 +    <dep dep="6" gov="4" type="conj" /> 
 +    <dep dep="4" gov="7" type="amod" /> 
 +    <dep dep="8" gov="7" type="nommod" /> 
 +  </sentence></code>
  
-<code xml><s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama"> +The same two sentences in the CoNLL format:
- <graph root="ratsep-13_501"> +
- <terminals> +
- <t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/> +
- <t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/> +
- <t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/> +
- <t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/> +
- <t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/> +
- <t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/> +
- <t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/> +
- </terminals>+
  
- <nonterminals> +| # b101.d.xml/1 |||||||||| 
- <nt id="ratsep-13_501" cat="VROOT"> +| 1 | Kävelyreitti | kävely<nowiki>|</nowiki>reitti | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 0 | ROOT | _ | _ | 
- <edge label="STA" idref="ratsep-13_502"/> +| 2 | III | III | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | _ | 1 | num | _ | _ | 
- </nt+| |||||||||| 
- <nt id="ratsep-13_502" cat="fcl"+| # b101.d.xml/2 |||||||||| 
- <edge label="S" idref="ratsep-13_1"/> +| 1 | Jäällä | jää | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | _ | 2 | nommod | _ | _ | 
- <edge label="P" idref="ratsep-13_2"/> +| 2 | kävely | kävely | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 3 | nsubj | _ | _ | 
- <edge label="A" idref="ratsep-13_503"/> +| 3 | avaa | avata | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | _ | 0 | ROOT | _ | _ | 
- <edge label="A" idref="ratsep-13_5"/> +| 4 | aina | aina | ADV | ADV | _ | 3 | advmod | _ | _ | 
- <edge label="A" idref="ratsep-13_6"/> +| 5 | hauskoja | hauska | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 8 | amod | _ | _ | 
- <edge label="FST" idref="ratsep-13_7"/> +| 6 | ja | ja | C<nowiki>|</nowiki>COORD | C<nowiki>|</nowiki>COORD | _ | 5 | cc | _ | _ | 
- </nt+| 7 | erikoisia | erikoinen | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 5 | conj | _ | _ | 
- <nt id="ratsep-13_503" cat="pp"> +| 8 | näkökulmia | näkö<nowiki>|</nowiki>kulma | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | _ | 3 | dobj | _ | _ | 
- <edge label="H" idref="ratsep-13_3"/> +| 9 | kaupunkiin | kaupunki | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 8 | nommod | _ | _ | 
- <edge label="D" idref="ratsep-13_4"/> +| 10 | . | . | PUNCT | PUNCT | _ | 3 | punct | _ | _ |
- </nt> +
- </nonterminals> +
- </graph+
-</s></code>+
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]