===== Swedish (sv) =====
[[http://stp.lingfil.uu.se/~nivre/research/Talbanken05.html|Talbanken05]]
==== Versions ====
* Talbanken76 (1976, Lund University)
* MAMBA (original syntactic and lexical annotation)
* FPS (flat phrase structure annotation, TIGER-XML)
* DPS (deepened phrase structure annotation, TIGER-XML)
* Dep (dependency structure annotation) in Malt-XML
* CoNLL 2006 conversion of Dep
==== Obtaining and License ====
Talbanken is freely available for download at http://w3.msi.vxu.se/~nivre/research/Talbanken05_1.1.tar.gz. The license in short:
* research and educational purposes
* credit the work done to produce the treebank
* cite principal publication in publications
* redistributability not discussed; might be permitted under the same license terms
Talbanken was created by members of the [[http://www.lu.se/|Lunds universitet]] and the Växjö universitet (today [[http://lnu.se/|Linnéuniversitetet]]).
==== References ====
* Website
* http://w3.msi.vxu.se/~nivre/research/Talbanken05.html
* Data
* Jan Einarsson, Tor G. Hultman, Nils Jörgensen, Ulf Teleman, Margareta Westman: Talbanken76 corpus
* Jens Nilsson, Johan Hall, Joakim Nivre: conversion Talbanken76 to Talbanken05
* Principal publications
* Joakim Nivre, Jens Nilsson, Johan Hall: [[http://w3.msi.vxu.se/~nivre/papers/talbanken05.pdf|Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation]]. In: Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, Genova, Italy, 2006.
* Jens Nilsson, Johan Hall, Joakim Nivre: [[http://w3.msi.vxu.se/~nivre/papers/treebank_nodalida05.pdf|MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity]]. In: Proceedings of the NODALIDA Special Session on Treebanks. 2005.
* Documentation
* Jens Nilsson, Johan Hall: [[http://w3.msi.vxu.se/~jni/papers/msi_report05067.pdf|Reconstruction of the Swedish Treebank Talbanken]]. MSI report 05067. Växjö universitet: School of Mathematics and Systems Engineering. ISSN 1650-2647. Växjö, Sweden, 2005.
* Daniel Zeman: [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset interpretation of the Mamba tagset]]. Praha, Czechia, 2007.
==== Domain ====
Mixed: “The text material consists of two sections, the so-called professional prose section (P), with data from textbooks, brochures, newspapers, etc., and a collection of high school students' essays (G).”
==== Size ====
The CoNLL 2006 version contains 197,123 tokens in 11431 sentences, yielding 17.24 tokens per sentence on average (CoNLL 2006 data split: 191,467 tokens / 11042 sentences training, 5656 tokens / 389 sentences test).
==== Inside ====
The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probably) manually. The tagset is very coarse, there are no morphological features, just the part of speech. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=sv::mamba|DZ Interset]] to inspect the tagset.
==== Sample ====
The first sentence of the CoNLL 2006 training data:
| 1 | Äktenskapet | _ | NN | NN | _ | 4 | SS | _ | _ |
| 2 | och | _ | ++ | ++ | _ | 3 | ++ | _ | _ |
| 3 | familjen | _ | NN | NN | _ | 1 | CC | _ | _ |
| 4 | är | _ | AV | AV | _ | 0 | ROOT | _ | _ |
| 5 | en | _ | EN | EN | _ | 7 | DT | _ | _ |
| 6 | gammal | _ | AJ | AJ | _ | 7 | AT | _ | _ |
| 7 | institution | _ | NN | NN | _ | 4 | SP | _ | _ |
| 8 | , | _ | IK | IK | _ | 7 | IK | _ | _ |
| 9 | som | _ | PO | PO | _ | 10 | SS | _ | _ |
| 10 | funnits | _ | VV | VV | _ | 7 | ET | _ | _ |
| 11 | sedan | _ | PR | PR | _ | 10 | TA | _ | _ |
| 12 | 1800-talet | _ | NN | NN | _ | 11 | PA | _ | _ |
| 13 | . | _ | IP | IP | _ | 4 | IP | _ | _ |
The first two sentences of the CoNLL 2006 test data:
| 1 | Individuell | _ | AJ | AJ | _ | 2 | AT | _ | _ |
| 2 | beskattning | _ | VN | VN | _ | 0 | ROOT | _ | _ |
| 3 | av | _ | PR | PR | _ | 2 | ET | _ | _ |
| 4 | arbetsinkomster | _ | NN | NN | _ | 3 | PA | _ | _ |
| ||||||||||
| 1 | Genom | _ | PR | PR | _ | 3 | AA | _ | _ |
| 2 | skattereformen | _ | NN | NN | _ | 1 | PA | _ | _ |
| 3 | införs | _ | VV | VV | _ | 0 | ROOT | _ | _ |
| 4 | individuell | _ | AJ | AJ | _ | 5 | AT | _ | _ |
| 5 | beskattning | _ | VN | VN | _ | 3 | SS | _ | _ |
| 6 | ( | _ | IR | IR | _ | 5 | IR | _ | _ |
| 7 | särbeskattning | _ | VN | VN | _ | 5 | AN | _ | _ |
| 8 | ) | _ | IR | IR | _ | 5 | JR | _ | _ |
| 9 | av | _ | PR | PR | _ | 5 | ET | _ | _ |
| 10 | arbetsinkomster | _ | NN | NN | _ | 9 | PA | _ | _ |
| 11 | . | _ | IP | IP | _ | 3 | IP | _ | _ |
==== Parsing ====
Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%).
The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish:
^ Parser (Authors) ^ LAS ^ UAS ^
| Microsoft (Corston-Oliver and Aue) | 79.69 | 89.54 |
| Malt (Nivre et al.) | 84.58 | 89.50 |
| Illinois (Do and Chang) | 82.31 | 89.05 |
| MST (McDonald et al.) | 82.55 | 88.93 |
| Kenji Sagae | 82.00 | 88.57 |
| Nara (Yuchang Cheng) | 81.08 | 88.57 |
| Basis (John O'Neil) | 81.78 | 88.45 |
| Riedel et al. | 80.66 | 88.33 |