[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
user:zeman:treebanks:pt [2012/01/11 11:22]
zeman Sample.
user:zeman:treebanks:pt [2012/01/11 11:28]
zeman Inside.
Line 46: Line 46:
 ==== Inside ==== ==== Inside ====
  
-Texts from Portugal and Brasil.+The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "​Palavras":​ Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).
  
-The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard BickThe Parsing System "​Palavras"​Automatic Grammatical Analysis of Portuguese in a Constraint Grammar FrameworkDr.philthesisAarhus UniversityAarhus, DenmarkAarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).+Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the ''​CPOS'',​ ''​POS''​ and ''​FEAT''​ columns of the CoNLL formatUse [[http://quest.ms.mff.cuni.cz/​cgi-bin/​interset/​index.pl?​tagset=pt::conll|DZ Interset]] to inspect ​the CoNLL tagset.
  
-In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file ''​tagset.txt''​. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file ''​syn_prot.pdf''​ (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file ''​diff.pdf''​ (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units. +Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "7_e_Meio", "Hillary_Clinton").
- +
-Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "Economische_en_Monetaire_Unie"). They have special part-of-speech tags ''​MWU''​their subparts of speech and features may describe the individual parts of the unit. E.g. "aan_het" ​has CPOS ''​MWU'',​ (sub)POS ''​Prep_Art''​ and features ''​voor_bep|onzijd|neut''​.+
  
 ==== Sample ==== ==== Sample ====

[ Back to the navigation ] [ Back to the content ]