===== Swedish (sv) ===== [[http://stp.lingfil.uu.se/~nivre/research/Talbanken05.html|Talbanken05]] ==== Versions ==== * Talbanken76 (1976, Lund University) * MAMBA (original syntactic and lexical annotation) * FPS (flat phrase structure annotation, TIGER-XML) * DPS (deepened phrase structure annotation, TIGER-XML) * Dep (dependency structure annotation) in Malt-XML * CoNLL 2006 conversion of Dep ==== Obtaining and License ==== Talbanken is freely available for download at http://w3.msi.vxu.se/~nivre/research/Talbanken05_1.1.tar.gz. The license in short: * research and educational purposes * credit the work done to produce the treebank * cite principal publication in publications * redistributability not discussed; might be permitted under the same license terms Talbanken was created by members of the [[http://www.lu.se/|Lunds universitet]] and the Växjö universitet (today [[http://lnu.se/|Linnéuniversitetet]]). ==== References ==== * Website * http://w3.msi.vxu.se/~nivre/research/Talbanken05.html * Data * Jan Einarsson, Tor G. Hultman, Nils Jörgensen, Ulf Teleman, Margareta Westman: Talbanken76 corpus * Jens Nilsson, Johan Hall, Joakim Nivre: conversion Talbanken76 to Talbanken05 * Principal publications * Joakim Nivre, Jens Nilsson, Johan Hall: [[http://w3.msi.vxu.se/~nivre/papers/talbanken05.pdf|Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation]]. In: Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, Genova, Italy, 2006. * Jens Nilsson, Johan Hall, Joakim Nivre: [[http://w3.msi.vxu.se/~nivre/papers/treebank_nodalida05.pdf|MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity]]. In: Proceedings of the NODALIDA Special Session on Treebanks. 2005. * Documentation * Jens Nilsson, Johan Hall: [[http://w3.msi.vxu.se/~jni/papers/msi_report05067.pdf|Reconstruction of the Swedish Treebank Talbanken]]. MSI report 05067. Växjö universitet: School of Mathematics and Systems Engineering. ISSN 1650-2647. Växjö, Sweden, 2005. * Daniel Zeman: [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset interpretation of the Mamba tagset]]. Praha, Czechia, 2007. ==== Domain ==== Mixed: “The text material consists of two sections, the so-called professional prose section (P), with data from textbooks, brochures, newspapers, etc., and a collection of high school students' essays (G).” ==== Size ==== The CoNLL 2006 version contains 197,123 tokens in 11431 sentences, yielding 17.24 tokens per sentence on average (CoNLL 2006 data split: 191,467 tokens / 11042 sentences training, 5656 tokens / 389 sentences test). ==== Inside ==== The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probably) manually. The tagset is very coarse, there are no morphological features, just the part of speech. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset]] to inspect the tagset. ==== Sample ==== The first sentence of the CoNLL 2006 training data: | 1 | Äktenskapet | _ | NN | NN | _ | 4 | SS | _ | _ | | 2 | och | _ | ++ | ++ | _ | 3 | ++ | _ | _ | | 3 | familjen | _ | NN | NN | _ | 1 | CC | _ | _ | | 4 | är | _ | AV | AV | _ | 0 | ROOT | _ | _ | | 5 | en | _ | EN | EN | _ | 7 | DT | _ | _ | | 6 | gammal | _ | AJ | AJ | _ | 7 | AT | _ | _ | | 7 | institution | _ | NN | NN | _ | 4 | SP | _ | _ | | 8 | , | _ | IK | IK | _ | 7 | IK | _ | _ | | 9 | som | _ | PO | PO | _ | 10 | SS | _ | _ | | 10 | funnits | _ | VV | VV | _ | 7 | ET | _ | _ | | 11 | sedan | _ | PR | PR | _ | 10 | TA | _ | _ | | 12 | 1800-talet | _ | NN | NN | _ | 11 | PA | _ | _ | | 13 | . | _ | IP | IP | _ | 4 | IP | _ | _ | The first two sentences of the CoNLL 2006 test data: | 1 | Individuell | _ | AJ | AJ | _ | 2 | AT | _ | _ | | 2 | beskattning | _ | VN | VN | _ | 0 | ROOT | _ | _ | | 3 | av | _ | PR | PR | _ | 2 | ET | _ | _ | | 4 | arbetsinkomster | _ | NN | NN | _ | 3 | PA | _ | _ | | |||||||||| | 1 | Genom | _ | PR | PR | _ | 3 | AA | _ | _ | | 2 | skattereformen | _ | NN | NN | _ | 1 | PA | _ | _ | | 3 | införs | _ | VV | VV | _ | 0 | ROOT | _ | _ | | 4 | individuell | _ | AJ | AJ | _ | 5 | AT | _ | _ | | 5 | beskattning | _ | VN | VN | _ | 3 | SS | _ | _ | | 6 | ( | _ | IR | IR | _ | 5 | IR | _ | _ | | 7 | särbeskattning | _ | VN | VN | _ | 5 | AN | _ | _ | | 8 | ) | _ | IR | IR | _ | 5 | JR | _ | _ | | 9 | av | _ | PR | PR | _ | 5 | ET | _ | _ | | 10 | arbetsinkomster | _ | NN | NN | _ | 9 | PA | _ | _ | | 11 | . | _ | IP | IP | _ | 3 | IP | _ | _ | ==== Parsing ==== Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%). The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish: ^ Parser (Authors) ^ LAS ^ UAS ^ | Microsoft (Corston-Oliver and Aue) | 79.69 | 89.54 | | Malt (Nivre et al.) | 84.58 | 89.50 | | Illinois (Do and Chang) | 82.31 | 89.05 | | MST (McDonald et al.) | 82.55 | 88.93 | | Kenji Sagae | 82.00 | 88.57 | | Nara (Yuchang Cheng) | 81.08 | 88.57 | | Basis (John O'Neil) | 81.78 | 88.45 | | Riedel et al. | 80.66 | 88.33 |