Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:ro [2012/01/12 17:17]
zeman Sample.
+++ user:zeman:treebanks:ro [2012/01/12 17:29] (current)
zeman Inside and parsing.
@@ Line 49: / Line 49: @@
 ==== Inside ====
-The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).
+Sentences have been segmented into clauses and there is a separate tree for each clause. There are no punctuation nodes, punctuation has been removed. The text lacks diacritical marks, i.e. the Romanian letters //ă, â, î, ş, ţ// have been replaced by //a, a, i, s, t// respectively.
-Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the ''CPOS'', ''POS'' and ''FEAT'' columns of the CoNLL format. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=pt::conll|DZ Interset]] to inspect the CoNLL tagset.
+There are part-of-speech tags but no lemmas and no morphological features (gender, number, case etc.) The part-of-speech tags were probably assigned manually, as well as the syntactic structure.
-Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "7_e_Meio", "Hillary_Clinton").
 ==== Sample ====
@@ Line 291: / Line 289: @@
 ==== Parsing ====
-Bosque is a mildly nonprojective treebank. 2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%).
+The corpus is projective.
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese:
-^ Parser (Authors) ^ LAS ^ UAS ^
-| MST (McDonald et al.) | 86.82 | 91.36 |
-| Malt (Nivre et al.) | 87.60 | 91.22 |
-| Nara (Yuchang Cheng) | 85.07 | 90.30 |
+I am not aware of any published evaluation of parsing accuracy on this data.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences