Table of Contents
Swedish (sv)
Versions
- Talbanken76 (1976, Lund University)
- MAMBA (original syntactic and lexical annotation)
- FPS (flat phrase structure annotation, TIGER-XML)
- DPS (deepened phrase structure annotation, TIGER-XML)
- Dep (dependency structure annotation) in Malt-XML
- CoNLL 2006 conversion of Dep
Obtaining and License
Talbanken is freely available for download at http://w3.msi.vxu.se/~nivre/research/Talbanken05_1.1.tar.gz. The license in short:
- research and educational purposes
- credit the work done to produce the treebank
- cite principal publication in publications
- redistributability not discussed; might be permitted under the same license terms
Talbanken was created by members of the Lunds universitet and the Växjö universitet (today Linnéuniversitetet).
References
- Website
- Data
- Jan Einarsson, Tor G. Hultman, Nils Jörgensen, Ulf Teleman, Margareta Westman: Talbanken76 corpus
- Jens Nilsson, Johan Hall, Joakim Nivre: conversion Talbanken76 to Talbanken05
- Principal publications
- Joakim Nivre, Jens Nilsson, Johan Hall: Talbanken05: A Swedish Treebank with Phrase Structure and Dependency Annotation. In: Proceedings of the fifth international conference on Language Resources and Evaluation (LREC2006), May 24-26, Genova, Italy, 2006.
- Jens Nilsson, Johan Hall, Joakim Nivre: MAMBA Meets TIGER: Reconstructing a Swedish Treebank from Antiquity. In: Proceedings of the NODALIDA Special Session on Treebanks. 2005.
- Documentation
- Jens Nilsson, Johan Hall: Reconstruction of the Swedish Treebank Talbanken. MSI report 05067. Växjö universitet: School of Mathematics and Systems Engineering. ISSN 1650-2647. Växjö, Sweden, 2005.
- Daniel Zeman: DZ Interset interpretation of the Mamba tagset. Praha, Czechia, 2007.
Domain
Mixed: “The text material consists of two sections, the so-called professional prose section (P), with data from textbooks, brochures, newspapers, etc., and a collection of high school students' essays (G).”
Size
The CoNLL 2006 version contains 197,123 tokens in 11431 sentences, yielding 17.24 tokens per sentence on average (CoNLL 2006 data split: 191,467 tokens / 11042 sentences training, 5656 tokens / 389 sentences test).
Inside
The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probably) manually. The tagset is very coarse, there are no morphological features, just the part of speech. Use DZ Interset to inspect the tagset.
Sample
The first sentence of the CoNLL 2006 training data:
1 | Äktenskapet | _ | NN | NN | _ | 4 | SS | _ | _ |
2 | och | _ | ++ | ++ | _ | 3 | ++ | _ | _ |
3 | familjen | _ | NN | NN | _ | 1 | CC | _ | _ |
4 | är | _ | AV | AV | _ | 0 | ROOT | _ | _ |
5 | en | _ | EN | EN | _ | 7 | DT | _ | _ |
6 | gammal | _ | AJ | AJ | _ | 7 | AT | _ | _ |
7 | institution | _ | NN | NN | _ | 4 | SP | _ | _ |
8 | , | _ | IK | IK | _ | 7 | IK | _ | _ |
9 | som | _ | PO | PO | _ | 10 | SS | _ | _ |
10 | funnits | _ | VV | VV | _ | 7 | ET | _ | _ |
11 | sedan | _ | PR | PR | _ | 10 | TA | _ | _ |
12 | 1800-talet | _ | NN | NN | _ | 11 | PA | _ | _ |
13 | . | _ | IP | IP | _ | 4 | IP | _ | _ |
The first two sentences of the CoNLL 2006 test data:
1 | Individuell | _ | AJ | AJ | _ | 2 | AT | _ | _ |
2 | beskattning | _ | VN | VN | _ | 0 | ROOT | _ | _ |
3 | av | _ | PR | PR | _ | 2 | ET | _ | _ |
4 | arbetsinkomster | _ | NN | NN | _ | 3 | PA | _ | _ |
1 | Genom | _ | PR | PR | _ | 3 | AA | _ | _ |
2 | skattereformen | _ | NN | NN | _ | 1 | PA | _ | _ |
3 | införs | _ | VV | VV | _ | 0 | ROOT | _ | _ |
4 | individuell | _ | AJ | AJ | _ | 5 | AT | _ | _ |
5 | beskattning | _ | VN | VN | _ | 3 | SS | _ | _ |
6 | ( | _ | IR | IR | _ | 5 | IR | _ | _ |
7 | särbeskattning | _ | VN | VN | _ | 5 | AN | _ | _ |
8 | ) | _ | IR | IR | _ | 5 | JR | _ | _ |
9 | av | _ | PR | PR | _ | 5 | ET | _ | _ |
10 | arbetsinkomster | _ | NN | NN | _ | 9 | PA | _ | _ |
11 | . | _ | IP | IP | _ | 3 | IP | _ | _ |
Parsing
Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish:
Parser (Authors) | LAS | UAS |
---|---|---|
Microsoft (Corston-Oliver and Aue) | 79.69 | 89.54 |
Malt (Nivre et al.) | 84.58 | 89.50 |
Illinois (Do and Chang) | 82.31 | 89.05 |
MST (McDonald et al.) | 82.55 | 88.93 |
Kenji Sagae | 82.00 | 88.57 |
Nara (Yuchang Cheng) | 81.08 | 88.57 |
Basis (John O'Neil) | 81.78 | 88.45 |
Riedel et al. | 80.66 | 88.33 |