Table of Contents
Dutch (nl)
Versions
- The Alpino Treebank 1.0 (2002) in an XML-based format
- CoNLL 2006
- Lassy (2007 and later) builds upon Alpino, is much larger but not under the same license
Obtaining and License
Alpino is available under the GNU General Public License. Download the original distribution (DTAG + TIGER-XML formats) from http://odur.let.rug.nl/~vannoord/ftp/AlpinoCDROM/. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html. The license in short:
- any usage, commercial or not
- modification and redistribution under same license permitted
- citation in publications not required (but it is common decency)
Alpino was created by members of the Alfa-informatica, Faculty of Arts (Faculteit der Letteren), University of Groningen (Rijksuniversiteit Groningen), Oude Kijk in 't Jatstraat 26, NL-9712 EK Groningen, The Netherlands.
References
- Website
- http://ilk.uvt.nl/conll/free_data.html (CoNLL 2006)
- Data
- no separate citation
- Principal publications
- Robert Malouf, Gertjan van Noord: Wide Coverage Parsing with Stochastic Attribute Value Grammars. In: Proceedings of Beyond Shallow Analyses – Formalisms and Statistical Modeling for Deep Analyses Workshop, IJCNLP, Sanya, Hainan, China, 2004.
- Leonoor van der Beek, Gosse Bouma, Jan Daciuk, Tanja Gaustad, Robert Malouf, Gertjan van Noord, Robbert Prins, Begoña Villada: Algorithms for Linguistic Processing NWO PIONIER Progress Report. Groningen, Netherlands, 2002.
- Documentation
Domain
Newspaper. The Alpino Treebank consists of “the full cdbl (newspaper) part of the Eindhoven corpus.”
Size
The CoNLL 2006 version contains 200,654 tokens in 13735 sentences, yielding 14.61 tokens per sentence on average (CoNLL 2006 data split: 195,069 tokens / 13349 sentences training, 5585 tokens / 386 sentences test).
Inside
In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file tagset.txt
. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file syn_prot.pdf
(Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file diff.pdf
(which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units.
Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. “Economische_en_Monetaire_Unie”). They have special part-of-speech tags MWU
, their subparts of speech and features may describe the individual parts of the unit. E.g. “aan_het” has CPOS MWU
, (sub)POS Prep_Art
and features voor_bep|onzijd|neut
.
Sample
The first two sentences of the CoNLL 2006 training data:
1 | Cathy | Cathy | N | N | eigen|ev|neut | 2 | su | _ | _ |
2 | zag | zie | V | V | trans|ovt|1of2of3|ev | 0 | ROOT | _ | _ |
3 | hen | hen | Pron | Pron | per|3|mv|datofacc | 2 | obj1 | _ | _ |
4 | wild | wild | Adj | Adj | attr|stell|onverv | 5 | mod | _ | _ |
5 | zwaaien | zwaai | N | N | soort|mv|neut | 2 | vc | _ | _ |
6 | . | . | Punc | Punc | punt | 5 | punct | _ | _ |
1 | Ze | ze | Pron | Pron | per|3|evofmv|nom | 2 | su | _ | _ |
2 | had | heb | V | V | trans|ovt|1of2of3|ev | 0 | ROOT | _ | _ |
3 | met | met | Prep | Prep | voor | 8 | mod | _ | _ |
4 | haar | haar | Pron | Pron | bez|3|ev|neut|attr | 5 | det | _ | _ |
5 | moeder | moeder | N | N | soort|ev|neut | 3 | obj1 | _ | _ |
6 | kunnen | kan | V | V | hulp|ott|1of2of3|mv | 2 | vc | _ | _ |
7 | gaan | ga | V | V | hulp|inf | 6 | vc | _ | _ |
8 | winkelen | winkel | V | V | intrans|inf | 11 | cnj | _ | _ |
9 | , | , | Punc | Punc | komma | 8 | punct | _ | _ |
10 | zwemmen | zwem | V | V | intrans|inf | 11 | cnj | _ | _ |
11 | of | of | Conj | Conj | neven | 7 | vc | _ | _ |
12 | terrassen | terras | N | N | soort|mv|neut | 11 | cnj | _ | _ |
13 | . | . | Punc | Punc | punt | 12 | punct | _ | _ |
The first two sentences of the CoNLL 2006 test data:
1 | BASISTAKENPAKKET | basis_taken_pakket | Prep | Prep | voor | 0 | ROOT | _ | _ |
2 | JEUGDGEZONDHEIDSZORG | jeugd_gezondheid_zorg | N | N | eigen|ev|neut | 0 | ROOT | _ | _ |
3 | 0-19 | 0-19 | Num | Num | hoofd|bep|attr|onverv | 4 | det | _ | _ |
4 | JAAR | JAAR | N | N | eigen|ev|neut | 0 | ROOT | _ | _ |
1 | Daarvoor | daarvoor | Adv | Adv | pron|aanw | 3 | pc | _ | _ |
2 | is | ben | V | V | hulpofkopp|ott|3|ev | 0 | ROOT | _ | _ |
3 | gekozen | kies | V | V | trans|verldw|onverv | 2 | vc | _ | _ |
4 | omdat | omdat | Conj | Conj | onder|metfin | 3 | mod | _ | _ |
5 | gemeenten | gemeente | N | N | soort|mv|neut | 11 | su | _ | _ |
6 | bij | bij | Prep | Prep | voor | 12 | mod | _ | _ |
7 | uitstek | uitstek | N | N | soort|ev|neut | 6 | obj1 | _ | _ |
8 | het | het | Art | Art | bep|onzijd|neut | 10 | det | _ | _ |
9 | lokale | lokaal | Adj | Adj | attr|stell|vervneut | 10 | mod | _ | _ |
10 | gezondheidsbeleid | gezondheid_beleid | N | N | soort|ev|neut | 12 | obj1 | _ | _ |
11 | kunnen | kan | V | V | hulp|inf | 4 | body | _ | _ |
12 | toespitsen | spits_toe | V | V | refl|inf | 11 | vc | _ | _ |
13 | op | op | Prep | Prep | voor | 12 | pc | _ | _ |
14 | de | de | Art | Art | bep|zijdofmv|neut | 16 | det | _ | _ |
15 | specifieke | specifiek | Adj | Adj | attr|stell|vervneut | 16 | mod | _ | _ |
16 | gezondheidssituatie | gezondheid_situatie | N | N | soort|ev|neut | 17 | cnj | _ | _ |
17 | en | en | Conj | Conj | neven | 13 | obj1 | _ | _ |
18 | zorgbehoeften | zorg_behoefte | N | N | soort|mv|neut | 17 | cnj | _ | _ |
19 | van | van | Prep | Prep | voor | 16 | mod | _ | _ |
20 | kinderen | kind | N | N | soort|mv|neut | 21 | cnj | _ | _ |
21 | en | en | Conj | Conj | neven | 19 | obj1 | _ | _ |
22 | jongeren | jongere | Adj | Adj | zelfst|vergr|vervneut | 21 | cnj | _ | _ |
23 | in | in | Prep | Prep | voor | 20 | mod | _ | _ |
24 | de | de | Art | Art | bep|zijdofmv|neut | 26 | det | _ | _ |
25 | eigen | eigen | Pron | Pron | aanw|neut|attr|weigen | 26 | mod | _ | _ |
26 | gemeente | gemeente | N | N | soort|ev|neut | 23 | obj1 | _ | _ |
27 | . | . | Punc | Punc | punt | 26 | punct | _ | _ |
Parsing
Nonprojectivities in Alpino are quite frequent. 10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Dutch:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 79.19 | 83.57 |
Riedel et al. | 78.59 | 82.91 |
Basis (John O'Neil) | 77.51 | 81.73 |
Malt (Nivre et al.) | 78.59 | 81.35 |