===== Finnish (fi) ===== [[http://bionlp.utu.fi/fintreebank.html|Turku Dependency Treebank]] (TDT) ==== Versions ==== * 23.5.2011 Downloadable from the website of the treebank ==== Obtaining and License ==== The TDT is freely [[http://bionlp.utu.fi/fintreebank-download.html|downloadable from here]] under the [[http://creativecommons.org/licenses/by-sa/3.0/|Creative Commons Attribution-Share Alike]] license. The license in short: * any usage, commercial or not * modification and redistribution permitted * linking to the [[http://bionlp.utu.fi/fintreebank.html|treebank website]] and citing the principal publication in publications required TDT was created by members of the [[http://bionlp.utu.fi/|Turku BioNLP Group]], University of Turku (Turun yliopisto), 20014 Turku, Finland. ==== References ==== * Website * http://bionlp.utu.fi/fintreebank.html * Data * //no separate citation// * Principal publications * Katri Haverinen, Filip Ginter, Veronika Laippala, Timo Viljanen, Tapio Salakoski: [[http://bionlp.utu.fi/sites/default/files/haverinen-et-al-2009.pdf|Dependency Annotation of Wikipedia: First Steps Towards a Finnish Treebank]]. In: Proceedings of The Eighth International Workshop on Treebanks and Linguistic Theories (TLT8). Milano, Italy, 2009. * Katri Haverinen, Timo Viljanen, Veronika Laippala, Samuel Kohonen, Filip Ginter, Tapio Salakoski: [[http://dspace.utlib.ee/dspace/handle/10062/15936|Treebanking Finnish]]. In: Proceedings of The Ninth International Workshop on Treebanks and Linguistic Theories (TLT9), pp. 79-90. Tartu, Estonia, 2010. * Documentation * The file FILE-FORMAT.txt in the distribution * [[http://www2.lingsoft.fi/doc/fintwol/intro/tags.html|Partial list of part-of-speech tags with descriptions]] (POS tagging has been done by www.lingsoft.fi) ==== Domain ==== Mixed (Wikipedia, Wikinews, university web-magazine and blogs). ==== Size ==== TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens / 430 sentences) for testing. ==== Inside ==== The native file format of the treebank is based on XML. Besides that, TDT is also distributed in the [[:format-conll|CoNLL-X format]]. The part-of-speech tag AND the morphosyntactic features are joined in one feature string, which is copied in both the CPOS and the POS columns of the CoNLL format. The FEAT column is empty (i.e. it contains the underscore character). Lemmas are available, too. Morphological annotation and disambiguation is automatic, it is no gold standard. The native XML format shows all morphological readings of every word based on the lexicon, and the disambiguation is left upon the user. ==== Sample ==== The first two sentences of the corpus in its native XML format: The same two sentences in the CoNLL format: | # b101.d.xml/1 |||||||||| | 1 | Kävelyreitti | kävely|reitti | NOM|up|SG|N | NOM|up|SG|N | _ | 0 | ROOT | _ | _ | | 2 | III | III | roman|NOM|up|SG|ABBR | roman|NOM|up|SG|ABBR | _ | 1 | num | _ | _ | | |||||||||| | # b101.d.xml/2 |||||||||| | 1 | Jäällä | jää | ADE|SG|up|N | ADE|SG|up|N | _ | 2 | nommod | _ | _ | | 2 | kävely | kävely | DV-U|NOM|SG|N | DV-U|NOM|SG|N | _ | 3 | nsubj | _ | _ | | 3 | avaa | avata | SG3|ACT|PRES|V | SG3|ACT|PRES|V | _ | 0 | ROOT | _ | _ | | 4 | aina | aina | ADV | ADV | _ | 3 | advmod | _ | _ | | 5 | hauskoja | hauska | A|PTV|POS|PL | A|PTV|POS|PL | _ | 8 | amod | _ | _ | | 6 | ja | ja | C|COORD | C|COORD | _ | 5 | cc | _ | _ | | 7 | erikoisia | erikoinen | A|PTV|POS|PL | A|PTV|POS|PL | _ | 5 | conj | _ | _ | | 8 | näkökulmia | näkö|kulma | PTV|PL|N | PTV|PL|N | _ | 3 | dobj | _ | _ | | 9 | kaupunkiin | kaupunki | ILL|SG|N | ILL|SG|N | _ | 8 | nommod | _ | _ | | 10 | . | . | PUNCT | PUNCT | _ | 3 | punct | _ | _ | ==== Parsing ==== Nonprojectivities in TDT are rare. Only 299 out of the 58576 tokens are attached nonprojectively (0.51%). I am not aware of any published evaluation of Finnish parsing accuracy.