[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:fi [2011/12/05 13:38]
zeman vytvořeno
user:zeman:treebanks:fi [2011/12/05 14:37]
zeman Size.
Line 20: Line 20:
  
   * Website   * Website
-    * http://vvv.cs.ut.ee/~kaili/Korpus/puud/ ([[http://translate.google.cz/translate?sl=et&tl=en&js=n&prev=_t&hl=cs&ie=UTF-8&layout=2&eotf=1&u=http%3A%2F%2Fvvv.cs.ut.ee%2F~kaili%2FKorpus%2Fpuud%2F&act=url|Google translate]])+    * http://bionlp.utu.fi/fintreebank.html
   * Data   * Data
     * //no separate citation//     * //no separate citation//
   * Principal publications   * Principal publications
-    * Kaili MüürisepTiina PuolakainenKadri MuischnekMare KoitTiit Roosmaa, Heli Uibo: [[https://nats-www.informatik.uni-hamburg.de/intern/proceedings/2003/RANLP/papers/p16.pdf|A New Language for Constraint GrammarEstonian]]. In: International Conference Recent Advances in Natural Language Processing. Proceedings, pp. 304-310, BorovetsBulgaria2003.+    * Katri HaverinenFilip GinterVeronika LaippalaTimo ViljanenTapio Salakoski: [[http://bionlp.utu.fi/sites/default/files/haverinen-et-al-2009.pdf|Dependency Annotation of WikipediaFirst Steps Towards a Finnish Treebank]]. In: Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8)Milano, Italy, 2009. 
 +    * Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, Tapio Salakoski: [[http://dspace.utlib.ee/dspace/handle/10062/15936|Treebanking Finnish]]. In: Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. TartuEstonia2010.
   * Documentation   * Documentation
-    * [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|File formats]] +    * The file FILE-FORMAT.txt in the distribution 
-    * The header of the TIGER-XML version of the treebank contains lists of various sorts of tags with brief explanation.+    * [[http://www2.lingsoft.fi/doc/fintwol/intro/tags.html|Partial list of part-of-speech tags with descriptions]] (POS tagging has been done by www.lingsoft.fi)
  
 ==== Domain ==== ==== Domain ====
  
-Mixed+Mixed (Wikipedia, Wikinews, university web-magazine and blogs).
-  * 388 tailored sentences with movement verbs +
-  * 732 sentences with movement verbs from the Estonian FrameNet corpus +
-  * 175 sentences from the Arborest corpus +
-  * 20 sentences of spoken language+
  
 ==== Size ==== ==== Size ====
  
-All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data splitshown in the last two rows of the table. +TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experimentswe took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens 430 sentences) for testing.
- +
-^ File ^ Sentences ^ Terminals ^ Average t/s ^ +
-| arborest.xml |  175 |  2451 |  14.01 | +
-| piialaused.xml |  732 |  4505 |  6.15 | +
-| ratsepalaused.xml |  388 |  2348 |  6.05 | +
-| sul.xml |  20 |  187 |  9.35 | +
-| **total** |  **1315** |  **9491** |  **7.22** | +
-| training |  1184 |  8535 |  7.21 | +
-| test |  131 |  956 |  7.30 |+
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]