[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
user:zeman:treebanks:ro [2012/01/12 17:17]
zeman Sample.
user:zeman:treebanks:ro [2012/01/12 17:29] (current)
zeman Inside and parsing.
Line 49: Line 49:
 ==== Inside ==== ==== Inside ====
  
-The corpus contains texts from Portugal ​and BrazilThe texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "​Palavras":​ Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. AarhusDenmark: Aarhus University PressNovember 2000.) and revised by linguists (the Bosque partreferred herewas totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).+Sentences have been segmented into clauses ​and there is a separate tree for each clauseThere are no punctuation nodes, punctuation has been removed. The text lacks diacritical marksi.e. the Romanian letters //ăâî, ş, ţ// have been replaced by //a, a, i, s, t// respectively.
  
-Morphological annotation includes ​lemmas. In the CoNLL versionthe original Floresta tags were converted to fit the ''​CPOS''​''​POS''​ and ''​FEAT''​ columns of the CoNLL formatUse [[http://​quest.ms.mff.cuni.cz/​cgi-bin/​interset/​index.pl?​tagset=pt::​conll|DZ Interset]] to inspect the CoNLL tagset. +There are part-of-speech tags but no lemmas ​and no morphological features (gendernumbercase etc.) The part-of-speech tags were probably assigned manuallyas well as the syntactic structure.
- +
-Multi-word expressions have been concatenated into one tokenusing underscore ​as the joining character (e.g. "​7_e_Meio",​ "​Hillary_Clinton"​).+
  
 ==== Sample ==== ==== Sample ====
Line 291: Line 289:
 ==== Parsing ==== ==== Parsing ====
  
-Bosque ​is a mildly nonprojective treebank. 2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%). +The corpus ​is projective.
- +
-The results of the CoNLL 2006 shared task are [[http://​ilk.uvt.nl/​conll/​results.html|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​W/​W06/​W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese:​ +
- +
-^ Parser (Authors) ^ LAS ^ UAS ^ +
-| MST (McDonald et al.) | 86.82 | 91.36 | +
-| Malt (Nivre et al.) | 87.60 | 91.22 | +
-| Nara (Yuchang Cheng) | 85.07 | 90.30 |+
  
 +I am not aware of any published evaluation of parsing accuracy on this data.

[ Back to the navigation ] [ Back to the content ]