[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:pt [2012/01/11 11:22]
zeman Sample.
user:zeman:treebanks:pt [2012/01/11 11:34]
zeman Parsing results.
Line 46: Line 46:
 ==== Inside ==== ==== Inside ====
  
-Texts from Portugal and Brasil.+The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "​Palavras":​ Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).
  
-The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard BickThe Parsing System "​Palavras"​Automatic Grammatical Analysis of Portuguese in a Constraint Grammar FrameworkDr.philthesisAarhus UniversityAarhus, DenmarkAarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).+Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the ''​CPOS'',​ ''​POS''​ and ''​FEAT''​ columns of the CoNLL formatUse [[http://quest.ms.mff.cuni.cz/​cgi-bin/​interset/​index.pl?​tagset=pt::conll|DZ Interset]] to inspect ​the CoNLL tagset.
  
-In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file ''​tagset.txt''​. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file ''​syn_prot.pdf''​ (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file ''​diff.pdf''​ (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units. +Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "7_e_Meio", "Hillary_Clinton").
- +
-Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "Economische_en_Monetaire_Unie"). They have special part-of-speech tags ''​MWU''​their subparts of speech and features may describe the individual parts of the unit. E.g. "aan_het" ​has CPOS ''​MWU'',​ (sub)POS ''​Prep_Art''​ and features ''​voor_bep|onzijd|neut''​.+
  
 ==== Sample ==== ==== Sample ====
Line 117: Line 115:
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in Alpino are quite frequent10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%).+Bosque is a mildly nonprojective treebank2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%).
  
-The results of the CoNLL 2006 shared task are [[http://​ilk.uvt.nl/​conll/​results.html|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​W/​W06/​W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://​ilk.uvt.nl/​conll/​results.html|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​W/​W06/​W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-| MST (McDonald et al.) | 79.19 83.57 +| MST (McDonald et al.) | 86.82 91.36 
-Riedel ​et al. | 78.59 82.91 +Malt (Nivre ​et al.87.60 | 91.22 
-| Basis (John O'​Neil) | 77.51 | 81.73 +Nara (Yuchang Cheng) | 85.07 90.30 |
-Malt (Nivre et al.) | 78.59 81.35 |+
  

[ Back to the navigation ] [ Back to the content ]