[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Swedish (sv)

Talbanken05

Versions

Obtaining and License

Talbanken is freely available for download at http://w3.msi.vxu.se/~nivre/research/Talbanken05_1.1.tar.gz. The license in short:

Talbanken was created by members of the Lunds universitet and the Växjö universitet (today Linnéuniversitetet).

References

Domain

Mixed: “The text material consists of two sections, the so-called professional prose section (P), with data from textbooks, brochures, newspapers, etc., and a collection of high school students' essays (G).”

Size

The CoNLL 2006 version contains 197,123 tokens in 11431 sentences, yielding 17.24 tokens per sentence on average (CoNLL 2006 data split: 191,467 tokens / 11042 sentences training, 5656 tokens / 389 sentences test).

Inside

The morphological analysis in the CoNLL 2006 version does not include lemmas. The part-of-speech tags have been assigned (probably) manually. The tagset is very coarse, there are no morphological features, just the part of speech. Use DZ Interset to inspect the tagset.

Sample

The first sentence of the CoNLL 2006 training data:

1 Äktenskapet _ NN NN _ 4 SS _ _
2 och _ ++ ++ _ 3 ++ _ _
3 familjen _ NN NN _ 1 CC _ _
4 är _ AV AV _ 0 ROOT _ _
5 en _ EN EN _ 7 DT _ _
6 gammal _ AJ AJ _ 7 AT _ _
7 institution _ NN NN _ 4 SP _ _
8 , _ IK IK _ 7 IK _ _
9 som _ PO PO _ 10 SS _ _
10 funnits _ VV VV _ 7 ET _ _
11 sedan _ PR PR _ 10 TA _ _
12 1800-talet _ NN NN _ 11 PA _ _
13 . _ IP IP _ 4 IP _ _

The first two sentences of the CoNLL 2006 test data:

1 Individuell _ AJ AJ _ 2 AT _ _
2 beskattning _ VN VN _ 0 ROOT _ _
3 av _ PR PR _ 2 ET _ _
4 arbetsinkomster _ NN NN _ 3 PA _ _
1 Genom _ PR PR _ 3 AA _ _
2 skattereformen _ NN NN _ 1 PA _ _
3 införs _ VV VV _ 0 ROOT _ _
4 individuell _ AJ AJ _ 5 AT _ _
5 beskattning _ VN VN _ 3 SS _ _
6 ( _ IR IR _ 5 IR _ _
7 särbeskattning _ VN VN _ 5 AN _ _
8 ) _ IR IR _ 5 JR _ _
9 av _ PR PR _ 5 ET _ _
10 arbetsinkomster _ NN NN _ 9 PA _ _
11 . _ IP IP _ 3 IP _ _

Parsing

Nonprojectivities in Talbanken are not frequent. Only 1928 of the 197,123 tokens in the CoNLL 2006 version are attached nonprojectively (0.98%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Swedish:

Parser (Authors) LAS UAS
Microsoft (Corston-Oliver and Aue) 79.69 89.54
Malt (Nivre et al.) 84.58 89.50
Illinois (Do and Chang) 82.31 89.05
MST (McDonald et al.) 82.55 88.93
Kenji Sagae 82.00 88.57
Nara (Yuchang Cheng) 81.08 88.57
Basis (John O'Neil) 81.78 88.45
Riedel et al. 80.66 88.33

[ Back to the navigation ] [ Back to the content ]