user:zeman:treebanks:nl [ufal wiki]

This is an old revision of the document!

Dutch (nl)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Dutch (nl)

Versions

The Alpino Treebank 1.0 (2002) in an XML-based format
CoNLL 2006
Lassy (2007 and later) builds upon Alpino, is much larger but not under the same license

Obtaining and License

Alpino is available under the GNU General Public License. Download the original distribution (DTAG + TIGER-XML formats) from http://odur.let.rug.nl/~vannoord/ftp/AlpinoCDROM/. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html. The license in short:

any usage, commercial or not
modification and redistribution under same license permitted
citation in publications not required (but it is common decency)

Alpino was created by members of the Alfa-informatica, Faculty of Arts (Faculteit der Letteren), University of Groningen (Rijksuniversiteit Groningen), Oude Kijk in 't Jatstraat 26, NL-9712 EK Groningen, The Netherlands.

References

Website
- http://odur.let.rug.nl/~vannoord/trees/ (Alpino)
- http://odur.let.rug.nl/~vannoord/Lassy/ (Lassy)
- http://ilk.uvt.nl/conll/free_data.html (CoNLL 2006)
Data
- no separate citation
Principal publications
- Robert Malouf, Gertjan van Noord: Wide Coverage Parsing with Stochastic Attribute Value Grammars. In: Proceedings of Beyond Shallow Analyses – Formalisms and Statistical Modeling for Deep Analyses Workshop, IJCNLP, Sanya, Hainan, China, 2004.
- Leonoor van der Beek, Gosse Bouma, Jan Daciuk, Tanja Gaustad, Robert Malouf, Gertjan van Noord, Robbert Prins, Begoña Villada: Algorithms for Linguistic Processing NWO PIONIER Progress Report. Groningen, Netherlands, 2002.
Documentation
- The files doc/tagset.txt, doc/syn_prot.pdf and doc/diffs.pdf in the CoNLL 2006 distribution.

Domain

full cdbl (newspaper) part of the Eindhoven corpus

Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from a wide range of randomly selected linguistically representative Danish texts from 1983-1992.”)

Size

The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test).

Inside

CoNLL Alpino: The orginal POS tags from the Alpino Treebank were replaced by POS

      tags from the Memory-based part-of-speech tagger using the WOTAN
      tagset, which is described in the file tagset.txt

The syntactic annotation is mostly identical to that of the Corpus

      Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the
      file syn_prot.pdf (Dutch only). An attempt to describe a number of
      differences between the CGN and Alpino annotation practice is given in
      the file diff.pdf (which is heavily out of date, but the number of
      differences has been reduced heavily recently.)

3.6 Conversion

      Issues:
      - head selection
      - multi-word units
      - discourse units

The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the DDT positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.

The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually.

Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.

Sample

The first sentence of DDT 1.0 in the DTAG format:

<tei.2>
  <teiHeader type=text>
    <fileDesc>
      <titleStmt>
        <title>Tagged sample of: 'Jeltsins skæbnetime'</title>
      </titleStmt>
      <extent words=158>158 running words</extent>
      <publicationStmt>
         <distributor>PAROLE-DK</distributor>
         <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address>
         <date>1998-06-02</date>
         <availability status=restricted><p>by agreement with distributor</availability>
      </publicationStmt>
      <sourceDesc>
        <biblStruct>
          <analytic>
            <title>Jeltsins skæbnetime</title>
            <author gender=m born=1925>Nikulin, Leon</author>
          </analytic>
          <monogr>
            <imprint><pubPlace>Denmark</pubPlace>
              <publisher>Det Fri Aktuelt</publisher>
              <date>1992-12-01</date>
            </imprint>
          </monogr>
        </biblStruct>
      </sourceDesc>
    </fileDesc>
    <profileDesc>
      <creation>1992-12-01</creation>
      <langUsage><language>Danish</langUsage>
      <textClass>
        <catRef target="P.M2">
        <catRef target="P.G4.8">
        <catRef target="P.T9.3">
      </textClass>
    </profileDesc>
  </teiHeader>
<text id=AJK>
<body>
<div1 type=main>
<p>
<s>
<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W>
<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W>
<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W>
<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W>
<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W>
<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W>
<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W>
<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W>
<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W>
<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W>
<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W>
<W lemma="," msd="XP" in="-2:pnct" out="">,</W>
<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W>
<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W>
<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W>
<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W>
<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W>
<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W>
<W lemma="&quot;" msd="XP" in="1:pnct" out="">"</W>
<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W>
<W lemma="&quot;" msd="XP" in="-1:pnct" out="">"</W>
<W lemma="." msd="XP" in="-12:pnct" out="">.</W>
</s>

The first sentence of the CoNLL 2006 training data:

1	Samme	_	A	AN	degree=pos\|gender=common/neuter\|number=sing/plur\|case=unmarked\|def=def/indef\|transcat=unmarked	0	ROOT	_	_
2	cifre	_	N	NC	gender=neuter\|number=plur\|case=unmarked\|def=indef	1	nobj	_	_
3	,	_	X	XP	_	1	pnct	_	_
4	de	_	P	PD	gender=common/neuter\|number=plur\|case=unmarked\|register=unmarked	7	subj	_	_
5	norske	_	A	AN	degree=pos\|gender=common/neuter\|number=plur\|case=unmarked\|def=def/indef\|transcat=unmarked	4	mod	_	_
6	piger	_	N	NC	gender=common\|number=plur\|case=unmarked\|def=indef	4	nobj	_	_
7	tabte	_	V	VA	mood=indic\|tense=past\|voice=active	1	rel	_	_
8	med	_	SP	SP	_	7	pobj	_	_
9	i_lørdags	_	RG	RG	degree=unmarked	7	mod	_	_
10	mod	_	SP	SP	_	7	pobj	_	_
11	VMs	_	N	NP	case=gen	10	nobj	_	_
12	værtsnation	_	N	NC	gender=common\|number=sing\|case=unmarked\|def=indef	11	possd	_	_
13	.	_	X	XP	_	1	pnct	_	_

The first sentence of the CoNLL 2006 test data:

1	To	_	A	AC	case=unmarked	10	subj	_	_
2	kendte	_	A	AN	degree=pos\|gender=common/neuter\|number=plur\|case=unmarked\|def=def/indef\|transcat=unmarked	1	mod	_	_
3	russiske	_	A	AN	degree=pos\|gender=common/neuter\|number=plur\|case=unmarked\|def=def/indef\|transcat=unmarked	1	mod	_	_
4	historikere	_	N	NC	gender=common\|number=plur\|case=unmarked\|def=indef	1	nobj	_	_
5	Andronik	_	N	NP	case=unmarked	6	namef	_	_
6	Mirganjan	_	N	NP	case=unmarked	1	appr	_	_
7	og	_	C	CC	_	6	coord	_	_
8	Igor	_	N	NP	case=unmarked	9	namef	_	_
9	Klamkin	_	N	NP	case=unmarked	7	conj	_	_
10	tror	_	V	VA	mood=indic\|tense=present\|voice=active	0	ROOT	_	_
11	ikke	_	RG	RG	degree=unmarked	10	mod	_	_
12	,	_	X	XP	_	10	pnct	_	_
13	at	_	C	CS	_	10	dobj	_	_
14	Rusland	_	N	NP	case=unmarked	15	subj	_	_
15	kan	_	V	VA	mood=indic\|tense=present\|voice=active	13	vobj	_	_
16	udvikles	_	V	VA	mood=infin\|voice=passive	15	vobj	_	_
17	uden	_	SP	SP	_	15	mod	_	_
18	en	_	P	PI	gender=common\|number=sing\|case=unmarked\|register=unmarked	17	nobj	_	_
19	“	_	X	XP	_	20	pnct	_	_
20	jernnæve	_	N	NC	gender=common\|number=sing\|case=unmarked\|def=indef	18	nobj	_	_
21	”	_	X	XP	_	20	pnct	_	_
22	.	_	X	XP	_	10	pnct	_	_

Parsing

Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:

Parser (Authors)	LAS	UAS
MST (McDonald et al.)	84.79	90.58
Malt (Nivre et al.)	84.77	89.80
Riedel et al.	83.63	89.66

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents