[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:nl [2012/01/10 12:11]
zeman Parsing.
user:zeman:treebanks:nl [2012/01/11 11:32]
zeman Typo.
Line 44: Line 44:
  
 In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file ''​tagset.txt''​. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file ''​syn_prot.pdf''​ (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file ''​diff.pdf''​ (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units. In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file ''​tagset.txt''​. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file ''​syn_prot.pdf''​ (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file ''​diff.pdf''​ (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units.
 +
 +Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "​Economische_en_Monetaire_Unie"​). They have special part-of-speech tags ''​MWU'',​ their subparts of speech and features may describe the individual parts of the unit. E.g. "​aan_het"​ has CPOS ''​MWU'',​ (sub)POS ''​Prep_Art''​ and features ''​voor_bep|onzijd|neut''​.
  
 ==== Sample ==== ==== Sample ====
Line 109: Line 111:
 Nonprojectivities in Alpino are quite frequent. 10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%). Nonprojectivities in Alpino are quite frequent. 10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%).
  
-The results of the CoNLL 2006 shared task are [[http://​ilk.uvt.nl/​conll/​results.html|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​W/​W06/​W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://​ilk.uvt.nl/​conll/​results.html|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​W/​W06/​W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Dutch:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^

[ Back to the navigation ] [ Back to the content ]