[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:pt [2012/01/11 11:18]
zeman Domain and size.
user:zeman:treebanks:pt [2012/01/11 11:34] (current)
zeman Parsing results.
Line 46: Line 46:
 ==== Inside ==== ==== Inside ====
  
-Texts from Portugal and Brasil.+The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).
  
-The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard BickThe Parsing System "Palavras"Automatic Grammatical Analysis of Portuguese in a Constraint Grammar FrameworkDr.philthesisAarhus UniversityAarhus, DenmarkAarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).+Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the ''CPOS'', ''POS'' and ''FEAT'' columns of the CoNLL formatUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=pt::conll|DZ Interset]] to inspect the CoNLL tagset.
  
-In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file ''tagset.txt''. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file ''syn_prot.pdf'' (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file ''diff.pdf'' (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units. +Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "7_e_Meio", "Hillary_Clinton").
- +
-Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "Economische_en_Monetaire_Unie"). They have special part-of-speech tags ''MWU''their subparts of speech and features may describe the individual parts of the unit. E.g. "aan_hethas CPOS ''MWU'', (sub)POS ''Prep_Art'' and features ''voor_bep|onzijd|neut''.+
  
 ==== Sample ==== ==== Sample ====
Line 58: Line 56:
 The first two sentences of the CoNLL 2006 training data: The first two sentences of the CoNLL 2006 training data:
  
-| 1 | Cathy Cathy | <nowiki>eigen|ev|neut</nowiki> | su | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 1 | Um um art art | <nowiki><arti>|M|S</nowiki> | 2 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 2 | zag | zie | V | V | <nowiki>trans|ovt|1of2of3|ev</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 2 | revivalismo revivalismo | <nowiki>M|S</nowiki>UTT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 3 | hen | hen | Pron | Pron | <nowiki>per|3|mv|datofacc</nowiki> | 2 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki>+refrescante refrescante adj adj | <nowiki>M|S</nowiki> | 2 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
-| 4 | wild wild Adj Adj | <nowiki>attr|stell|onverv</nowiki>mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-zwaaien zwaai | <nowiki>soort|mv|neut</nowiki> | 2 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> +
-| 6 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Punc | Punc | punt | 5 | punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> |+
 | |||||||||| | ||||||||||
-| 1 | Ze ze Pron Pron | <nowiki>per|3|evofmv|nom</nowiki> | 2 | su | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 1 | art art | <nowiki><artd>|M|S</nowiki> | 2 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 2 | had | heb | V | V | <nowiki>trans|ovt|1of2of3|ev</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> +| 2 | <nowiki>7_e_Meio</nowiki> | <nowiki>7_e_Meio</nowiki> | prop | prop | <nowiki>M|S</nowiki> | 3 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 3 | met | met | Prep | Prep | voor | 8 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+é ser <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki>STA | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-haar haar Pron Pron | <nowiki>bez|3|ev|neut|attr</nowiki>det | <nowiki>_</nowiki> | <nowiki>_</nowiki>+um um art art | <nowiki><arti>|M|S</nowiki>| <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | 
-moeder moeder | <nowiki>soort|ev|neut</nowiki>3 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> +| <nowiki>ex-libris</nowiki> | <nowiki>ex-libris</nowiki>| <nowiki>M|P</nowiki>SC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 6 | kunnen | kan | V | V | <nowiki>hulp|ott|1of2of3|mv</nowiki>vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | +de de prp prp | <nowiki><sam-></nowiki><nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-7 | gaan | ga | V | V | <nowiki>hulp|inf</nowiki>vc | <nowiki>_</nowiki> | <nowiki>_</nowiki>+7 | a | o | art | art | <nowiki><-sam>|<artd>|S</nowiki> | 8 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-winkelen winkel | <nowiki>intrans|inf</nowiki>11 cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki>+noite noite | <nowiki>F|S</nowiki>| <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| <nowiki>,</nowiki> | <nowiki>,</nowiki> | Punc | Punc | komma | 8 | punct | <nowiki>_</nowiki> | <nowiki>_</nowiki>+algarvia algarvio adj adj | <nowiki>F|S</nowiki><nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-10 zwemmen zwem | <nowiki>intrans|inf</nowiki>11 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> +10 | <nowiki>.</nowiki> | <nowiki>.</nowiki>punc punc <nowiki>_</nowiki> PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
-| 11 | of | of | Conj | Conj | neven | 7 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-12 terrassen terras | <nowiki>soort|mv|neut</nowiki>11 cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-13 | <nowiki>.</nowiki> | <nowiki>.</nowiki>Punc Punc punt 12 punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> |+
  
 The first two sentences of the CoNLL 2006 test data: The first two sentences of the CoNLL 2006 test data:
  
-| 1 | BASISTAKENPAKKET | <nowiki>basis_taken_pakket</nowiki>Prep Prep voor ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 1 | É | é | adv | adv | <nowiki><foc></nowiki>FOC <nowiki>_</nowiki> <nowiki>_</nowiki>
-| 2 | JEUGDGEZONDHEIDSZORG | <nowiki>jeugd_gezondheid_zorg</nowiki>| <nowiki>eigen|ev|neut</nowiki>ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 2 | por | por | prp prp | <nowiki>_</nowiki> | 9 | ADVL | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| <nowiki>0-19</nowiki> | <nowiki>0-19</nowiki>Num Num | <nowiki>hoofd|bep|attr|onverv</nowiki>det | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 3 | isso | isso | pron | <nowiki>pron-indp</nowiki> | <nowiki><dem>|M|S</nowiki> | 2 | <nowiki>P<</nowiki> | <nowiki>_</nowiki><nowiki>_</nowiki> | 
-JAAR JAAR | N | <nowiki>eigen|ev|neut</nowiki>ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |+| 4 | que | que | adv | adv | <nowiki><foc></nowiki> 9 | FOC <nowiki>_</nowiki><nowiki>_</nowiki> | 
 +| 5 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | punc | punc | <nowiki>_</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +6 | explica | explicar | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki>| STA | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 7 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | punc | punc | <nowiki>_</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 8 | não | não | adv | adv | <nowiki>_</nowiki> | 9 | ADVL | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 9 | tem | ter | v | <nowiki>v-fin</nowiki><nowiki>PR|3S|IND</nowiki> | ACC <nowiki>_</nowiki><nowiki>_</nowiki> | 
 +| 10 | pena | pena | n | n | <nowiki>F|S</nowiki> | 9 | ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +11 de de prp prp | <nowiki>_</nowiki> | 10 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> <nowiki>_</nowiki> | 
 +| 12 | <nowiki>Hillary_Clinton</nowiki><nowiki>Hillary_Clinton</nowiki> | prop | prop | <nowiki>F|S</nowiki> | 11 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
 +| 13 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | punc | punc | <nowiki>_</nowiki> | 6 PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
 | |||||||||| | ||||||||||
-| 1 | Daarvoor daarvoor Adv Adv | <nowiki>pron|aanw</nowiki>pc | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 1 | <nowiki>«</nowiki> <nowiki>«</nowiki> punc punc | <nowiki>_</nowiki>PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 2 | is ben | <nowiki>hulpofkopp|ott|3|ev</nowiki>ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 2 | Eles ele pron <nowiki>pron-pers</nowiki> | <nowiki>M|3P|NOM</nowiki>SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 3 | gekozen kies | <nowiki>trans|verldw|onverv</nowiki>vc | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 3 | <nowiki>[</nowiki> <nowiki>[</nowiki> punc punc | <nowiki>_</nowiki>PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 4 | omdat omdat Conj Conj | <nowiki>onder|metfin</nowiki>mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 4 | Hillary Hillary prop prop | <nowiki>F|S</nowiki>APP | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 5 | gemeenten gemeente N | N | <nowiki>soort|mv|neut</nowiki>11 su | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 5 | conj | <nowiki>conj-c</nowiki> <nowiki><co-app></nowiki>CO | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 6 | bij bij Prep Prep voor 12 mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 6 | <nowiki>Bill_Clinton</nowiki> <nowiki>Bill_Clinton</nowiki> prop prop <nowiki>M|S</nowiki> | 4 CJT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 7 | uitstek uitstek | <nowiki>soort|ev|neut</nowiki>obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 7 | <nowiki>]</nowiki> <nowiki>]</nowiki> punc punc | <nowiki>_</nowiki>PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 8 | het het Art Art | <nowiki>bep|onzijd|neut</nowiki>10 det | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 8 | podem poder <nowiki>v-fin</nowiki> | <nowiki>PR|3P|IND</nowiki>QUE | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 9 | lokale lokaal Adj | Adj | <nowiki>attr|stell|vervneut</nowiki>10 mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 9 | ter ter | <nowiki>v-inf</nowiki> <nowiki>_</nowiki>MV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 10 | gezondheidsbeleid | <nowiki>gezondheid_beleid</nowiki> | N | N | <nowiki>soort|ev|neut</nowiki>12 obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 10 | alguma | algum | pron | <nowiki>pron-det</nowiki> | <nowiki><quant>|F|S</nowiki>11 <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 11 | kunnen kan | <nowiki>hulp|inf</nowiki>body | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 11 | espécie espécie | <nowiki>F|S</nowiki>ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 12 | toespitsen | <nowiki>spits_toe</nowiki>V | V | <nowiki>refl|inf</nowiki> | 11 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 12 | de | de | prp | prp | <nowiki>_</nowiki>11 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 13 | op op Prep Prep voor | 12 | pc | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 13 | acordo acordo <nowiki>M|S</nowiki> | 12 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 14 | de de Art Art | <nowiki>bep|zijdofmv|neut</nowiki>16 det | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 14 | conj <nowiki>conj-c</nowiki> | <nowiki><co-vfin>|<co-fmc></nowiki>CO | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 15 | specifieke specifiek Adj Adj | <nowiki>attr|stell|vervneut</nowiki> | 16 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 15 | quem quem pron <nowiki>pron-indp</nowiki> | <nowiki><interr>|M/F|P</nowiki> | 16 | SC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 16 | gezondheidssituatie | <nowiki>gezondheid_situatie</nowiki> | N | N | <nowiki>soort|ev|neut</nowiki>17 cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 16 | somos | ser | v | <nowiki>v-fin</nowiki> | <nowiki>PR|1P|IND</nowiki>CJT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 17 | en en Conj Conj neven 13 obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 17 | nós nós pron <nowiki>pron-pers</nowiki> <nowiki>M/F|1P|NOM</nowiki> | 16 SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 18 | zorgbehoeften <nowiki>zorg_behoefte</nowiki> | <nowiki>soort|mv|neut</nowiki>17 cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 18 | para para prp prp | <nowiki>_</nowiki>16 ADVL | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 19 | van van Prep | Prep | voor | 16 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | +| 19 | dizer dizer | <nowiki>v-inf</nowiki> | <nowiki>_</nowiki>18 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 20 | kinderen | kind | N | N | <nowiki>soort|mv|neut</nowiki> | 21 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki>+20 se se conj | <nowiki>conj-s</nowiki> | <nowiki>_</nowiki>21 SUB | <nowiki>_</nowiki> <nowiki>_</nowiki> 
-21 en en Conj | Conj | neven | 19 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | +| 21 | é | ser | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki>19 ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-22 | jongeren | jongere | Adj | Adj | <nowiki>zelfst|vergr|vervneut</nowiki> | 21 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | +22 bom bom adj adj | <nowiki>M|S</nowiki>21 SC | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-23 | in | in | Prep | Prep | voor | 20 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+23 ou ou conj | <nowiki>conj-c</nowiki> <nowiki><co-sc></nowiki>22 CO | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-24 de de Art Art | <nowiki>bep|zijdofmv|neut</nowiki>26 det | <nowiki>_</nowiki> | <nowiki>_</nowiki>+24 mau mau adj adj | <nowiki>M|S</nowiki>22 CJT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-25 eigen eigen Pron | Pron | <nowiki>aanw|neut|attr|weigen</nowiki>26 mod | <nowiki>_</nowiki> | <nowiki>_</nowiki>+25 | <nowiki>?</nowiki> | <nowiki>?</nowiki>punc punc <nowiki>_</nowiki> PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
-26 gemeente gemeente | <nowiki>soort|ev|neut</nowiki>23 obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-27 | <nowiki>.</nowiki> | <nowiki>.</nowiki>Punc Punc punt 26 punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> |+
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in Alpino are quite frequent10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%).+Bosque is a mildly nonprojective treebank2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-| MST (McDonald et al.) | 79.19 83.57 +| MST (McDonald et al.) | 86.82 91.36 
-Riedel et al. | 78.59 82.91 +Malt (Nivre et al.87.60 | 91.22 
-| Basis (John O'Neil) | 77.51 | 81.73 +Nara (Yuchang Cheng) | 85.07 90.30 |
-Malt (Nivre et al.) | 78.59 81.35 |+
  

[ Back to the navigation ] [ Back to the content ]