[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:ro [2012/01/12 17:11]
zeman References, domain and size.
user:zeman:treebanks:ro [2012/01/12 17:29] (current)
zeman Inside and parsing.
Line 49: Line 49:
 ==== Inside ==== ==== Inside ====
  
-The corpus contains texts from Portugal and BrazilThe texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. AarhusDenmark: Aarhus University PressNovember 2000.) and revised by linguists (the Bosque partreferred herewas totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).+Sentences have been segmented into clauses and there is a separate tree for each clauseThere are no punctuation nodes, punctuation has been removed. The text lacks diacritical marksi.e. the Romanian letters //ăâî, ş, ţ// have been replaced by //a, a, i, s, t// respectively.
  
-Morphological annotation includes lemmas. In the CoNLL versionthe original Floresta tags were converted to fit the ''CPOS''''POS'' and ''FEAT'' columns of the CoNLL formatUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=pt::conll|DZ Interset]] to inspect the CoNLL tagset. +There are part-of-speech tags but no lemmas and no morphological features (gendernumbercase etc.) The part-of-speech tags were probably assigned manuallyas well as the syntactic structure.
- +
-Multi-word expressions have been concatenated into one tokenusing underscore as the joining character (e.g. "7_e_Meio", "Hillary_Clinton").+
  
 ==== Sample ==== ==== Sample ====
  
-The first two sentences of the CoNLL 2006 training data:+The first sentence of our training data:
  
-| 1 | Um | um | art | art | <nowiki><arti>|M|S</nowiki| 2 | <nowiki>>N</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +<code xml><?xml version="1.0" encoding="utf-8"?> 
-| 2 | revivalismo | revivalismo | n | n | <nowiki>M|S</nowiki| 0 | UTT | <nowiki>_</nowiki<nowiki>_</nowiki> | +<!DOCTYPE DGAdoc SYSTEM "dga.dtd"> 
-| 3 | refrescante | refrescante | adj | adj | <nowiki>M|S</nowiki| 2 | <nowiki>N<</nowiki| <nowiki>_</nowiki| <nowiki>_</nowiki| +<DGAdoc> 
-| |||||||||| +<s> 
-| 1 | O | o | art | art | <nowiki><artd>|M|S</nowiki| 2 | <nowiki>>N</nowiki<nowiki>_</nowiki<nowiki>_</nowiki> | +   <tok> 
-| 2 | <nowiki>7_e_Meio</nowiki<nowiki>7_e_Meio</nowiki| prop | prop | <nowiki>M|S</nowiki| 3 | SUBJ | <nowiki>_</nowiki><nowiki>_</nowiki| +      <orth>Marii</orth
-| 3 | é | ser | v | <nowiki>v-fin</nowiki<nowiki>PR|3S|IND</nowiki| 0 | STA | <nowiki>_</nowiki| <nowiki>_</nowiki| +      <ordno>1</ordno> 
-| 4 | um | um | art | art | <nowiki><arti>|M|S</nowiki| 5 | <nowiki>>N</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +      <ctag>adjectiv</ctag> 
-| 5 | <nowiki>ex-libris</nowiki<nowiki>ex-libris</nowiki| n | n | <nowiki>M|P</nowiki| SC | <nowiki>_</nowiki<nowiki>_</nowiki| +      <syn
-| 6 | de | de | prp | prp | <nowiki><sam-></nowiki| 5 | <nowiki>N<</nowiki| <nowiki>_</nowiki<nowiki>_</nowiki| +         <head>2</head> 
-| 7 | a | o | art | art | <nowiki><-sam>|<artd>|S</nowiki| 8 | <nowiki>>N</nowiki| <nowiki>_</nowiki<nowiki>_</nowiki| +         <reltype>atribut adj.</reltype> 
-| 8 | noite | noite | n | n | <nowiki>F|S</nowiki| 6 | <nowiki>P<</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +      </syn> 
-| 9 | algarvia | algarvio | adj | adj | <nowiki>F|S</nowiki| 8 | <nowiki>N<</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +   </tok
-| 10 | <nowiki>.</nowiki| <nowiki>.</nowiki| punc | punc | <nowiki>_</nowiki| 3 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki|+   <tok> 
 +      <orth>bancheri</orth> 
 +      <ordno>2</ordno> 
 +      <ctag>substantiv</ctag> 
 +      <syn
 +         <head>6</head> 
 +         <reltype>subiect</reltype> 
 +      </syn> 
 +   </tok> 
 +   <tok> 
 +      <orth>ai</orth
 +      <ordno>3</ordno> 
 +      <ctag>art. poses.</ctag> 
 +      <syn> 
 +         <head>2</head
 +         <reltype>atribut subst.</reltype> 
 +      </syn> 
 +   </tok> 
 +   <tok> 
 +      <orth>lumii</orth
 +      <ordno>4</ordno> 
 +      <ctag>substantiv</ctag> 
 +      <syn> 
 +         <head>3</head> 
 +         <reltype>rel. poses.</reltype
 +      </syn> 
 +   </tok> 
 +   <tok> 
 +      <orth>au</orth> 
 +      <ordno>5</ordno
 +      <ctag>verb aux.</ctag> 
 +      <syn> 
 +         <head>6</head> 
 +         <reltype>rel. aux.</reltype
 +      </syn> 
 +   </tok> 
 +   <tok> 
 +      <orth>reusit</orth
 +      <ordno>6</ordno> 
 +      <ctag>verb</ctag> 
 +      <syn> 
 +         <head>7</head
 +         <reltype>predicat</reltype> 
 +      </syn> 
 +   </tok> 
 +</s></code>
  
-The first two sentences of the CoNLL 2006 test data:+The first sentence of our test data:
  
-| 1 | É | é | adv | adv | <nowiki><foc></nowiki| 9 | FOC | <nowiki>_</nowiki<nowiki>_</nowiki> | +<code xml><?xml version="1.0" encoding="utf-8"?> 
-| 2 | por | por | prp | prp | <nowiki>_</nowiki| 9 | ADVL | <nowiki>_</nowiki<nowiki>_</nowiki| +<!DOCTYPE DGAdoc SYSTEM "dga.dtd"> 
-| 3 | isso | isso | pron | <nowiki>pron-indp</nowiki| <nowiki><dem>|M|S</nowiki| 2 | <nowiki>P<</nowiki| <nowiki>_</nowiki| <nowiki>_</nowiki| +<DGAdoc> 
-| 4 | que | que | adv | adv | <nowiki><foc></nowiki| 9 | FOC | <nowiki>_</nowiki<nowiki>_</nowiki| +<s> 
-| 5 | <nowiki>,</nowiki| <nowiki>,</nowiki| punc | punc | <nowiki>_</nowiki| 6 | PUNC | <nowiki>_</nowiki<nowiki>_</nowiki> | +   <tok
-| 6 | explica | explicar | v | <nowiki>v-fin</nowiki<nowiki>PR|3S|IND</nowiki| 0 | STA | <nowiki>_</nowiki<nowiki>_</nowiki> | +      <orth>Judecatorul</orth> 
-| 7 | <nowiki>,</nowiki<nowiki>,</nowiki| punc | punc | <nowiki>_</nowiki| 6 | PUNC | <nowiki>_</nowiki| <nowiki>_</nowiki| +      <ordno>1</ordno> 
-| 8 | não | não | adv | adv | <nowiki>_</nowiki| 9 | ADVL | <nowiki>_</nowiki<nowiki>_</nowiki> | +      <ctag>substantiv</ctag
-| 9 | tem | ter | v | <nowiki>v-fin</nowiki<nowiki>PR|3S|IND</nowiki| 6 | ACC | <nowiki>_</nowiki<nowiki>_</nowiki> | +      <syn> 
-| 10 | pena | pena | n | n | <nowiki>F|S</nowiki| 9 | ACC | <nowiki>_</nowiki<nowiki>_</nowiki| +         <head>9</head> 
-| 11 | de | de | prp | prp | <nowiki>_</nowiki| 10 | <nowiki>N<</nowiki<nowiki>_</nowiki<nowiki>_</nowiki| +         <reltype>subiect</reltype> 
-| 12 | <nowiki>Hillary_Clinton</nowiki| <nowiki>Hillary_Clinton</nowiki| prop | prop | <nowiki>F|S</nowiki| 11 | <nowiki>P<</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +      </syn> 
-| 13 | <nowiki>.</nowiki<nowiki>.</nowiki| punc | punc | <nowiki>_</nowiki> | 6 | PUNC | <nowiki>_</nowiki<nowiki>_</nowiki| +   </tok
-| |||||||||| +   <tok> 
-| 1 | <nowiki>«</nowiki<nowiki>«</nowiki| punc | punc | <nowiki>_</nowiki| 8 | PUNC | <nowiki>_</nowiki<nowiki>_</nowiki| +      <orth>sindic</orth> 
-| 2 | Eles | ele | pron | <nowiki>pron-pers</nowiki<nowiki>M|3P|NOM</nowiki| SUBJ | <nowiki>_</nowiki<nowiki>_</nowiki| +      <ordno>2</ordno> 
-| 3 | <nowiki>[</nowiki| <nowiki>[</nowiki| punc | punc | <nowiki>_</nowiki| 8 | PUNC | <nowiki>_</nowiki| <nowiki>_</nowiki| +      <ctag>adjectiv</ctag
-| 4 | Hillary | Hillary | prop | prop | <nowiki>F|S</nowiki| 9 | APP | <nowiki>_</nowiki<nowiki>_</nowiki| +      <syn> 
-| 5 | e | e | conj | <nowiki>conj-c</nowiki<nowiki><co-app></nowiki| 4 | CO | <nowiki>_</nowiki| <nowiki>_</nowiki| +         <head>1</head> 
-| 6 | <nowiki>Bill_Clinton</nowiki<nowiki>Bill_Clinton</nowiki| prop | prop | <nowiki>M|S</nowiki| 4 | CJT | <nowiki>_</nowiki<nowiki>_</nowiki| +         <reltype>atribut adj.</reltype> 
-| 7 | <nowiki>]</nowiki| <nowiki>]</nowiki| punc | punc | <nowiki>_</nowiki| 8 | PUNC | <nowiki>_</nowiki| <nowiki>_</nowiki| +      </syn> 
-| 8 | podem | poder | v | <nowiki>v-fin</nowiki| <nowiki>PR|3P|IND</nowiki| 0 | QUE | <nowiki>_</nowiki<nowiki>_</nowiki| +   </tok> 
-| 9 | ter | ter | v | <nowiki>v-inf</nowiki| <nowiki>_</nowiki| 8 | MV | <nowiki>_</nowiki| <nowiki>_</nowiki| +   <tok
-| 10 | alguma | algum | pron | <nowiki>pron-det</nowiki<nowiki><quant>|F|S</nowiki| 11 | <nowiki>>N</nowiki<nowiki>_</nowiki<nowiki>_</nowiki+      <orth>IonBriac</orth> 
-11 | espécie | espécie | n | n | <nowiki>F|S</nowiki| 9 | ACC | <nowiki>_</nowiki<nowiki>_</nowiki| +      <ordno>3</ordno> 
-| 12 | de | de | prp | prp | <nowiki>_</nowiki| 11 | <nowiki>N<</nowiki<nowiki>_</nowiki<nowiki>_</nowiki| +      <ctag>substantiv</ctag> 
-| 13 | acordo | acordo | n | n | <nowiki>M|S</nowiki12 | <nowiki>P<</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +      <syn
-| 14 | e | e | conj | <nowiki>conj-c</nowiki<nowiki><co-vfin>|<co-fmc></nowiki| 8 | CO | <nowiki>_</nowiki<nowiki>_</nowiki| +         <head>1</head> 
-| 15 | quem | quem | pron | <nowiki>pron-indp</nowiki| <nowiki><interr>|M/F|P</nowiki| 16 | SC | <nowiki>_</nowiki| <nowiki>_</nowiki| +         <reltype>atribut subst.</reltype> 
-| 16 | somos | ser | v | <nowiki>v-fin</nowiki<nowiki>PR|1P|IND</nowiki| 8 | CJT | <nowiki>_</nowiki<nowiki>_</nowiki| +      </syn> 
-| 17 | nós | nós | pron | <nowiki>pron-pers</nowiki<nowiki>M/F|1P|NOM</nowiki| 16 | SUBJ | <nowiki>_</nowiki<nowiki>_</nowiki| +   </tok> 
-| 18 | para | para | prp | prp | <nowiki>_</nowiki| 16 | ADVL | <nowiki>_</nowiki<nowiki>_</nowiki> | +   <tok> 
-| 19 | dizer | dizer | v | <nowiki>v-inf</nowiki<nowiki>_</nowiki| 18 | <nowiki>P<</nowiki<nowiki>_</nowiki| <nowiki>_</nowiki| +      <orth>de</orth
-| 20 | se | se | conj | <nowiki>conj-s</nowiki| <nowiki>_</nowiki| 21 | SUB | <nowiki>_</nowiki<nowiki>_</nowiki> | +      <ordno>4</ordno> 
-| 21 | é | ser | v | <nowiki>v-fin</nowiki<nowiki>PR|3S|IND</nowiki| 19 | ACC | <nowiki>_</nowiki<nowiki>_</nowiki> | +      <ctag>prepozitie</ctag> 
-| 22 | bom | bom | adj | adj | <nowiki>M|S</nowiki| 21 | SC | <nowiki>_</nowiki| <nowiki>_</nowiki| +      <syn
-| 23 | ou | ou | conj | <nowiki>conj-c</nowiki<nowiki><co-sc></nowiki| 22 | CO | <nowiki>_</nowiki<nowiki>_</nowiki| +         <head>1</head> 
-| 24 | mau | mau | adj | adj | <nowiki>M|S</nowiki| 22 | CJT | <nowiki>_</nowiki<nowiki>_</nowiki| +         <reltype>atribut subst.</reltype> 
-| 25 | <nowiki>?</nowiki| <nowiki>?</nowiki| punc | punc | <nowiki>_</nowiki> | 8 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki|+      </syn> 
 +   </tok> 
 +   <tok
 +      <orth>la</orth> 
 +      <ordno>5</ordno> 
 +      <ctag>prepozitie</ctag
 +      <syn> 
 +         <head>4</head> 
 +         <reltype>rel. prepoz.</reltype> 
 +      </syn> 
 +   </tok
 +   <tok> 
 +      <orth>Tribunalul</orth> 
 +      <ordno>6</ordno> 
 +      <ctag>substantiv</ctag> 
 +      <syn> 
 +         <head>4</head
 +         <reltype>rel. prepoz.</reltype> 
 +      </syn> 
 +   </tok> 
 +   <tok> 
 +      <orth>CarasSeverin</orth> 
 +      <ordno>7</ordno
 +      <ctag>substantiv</ctag> 
 +      <syn> 
 +         <head>6</head> 
 +         <reltype>atribut subst.</reltype> 
 +      </syn> 
 +   </tok
 +   <tok> 
 +      <orth>a</orth> 
 +      <ordno>8</ordno> 
 +      <ctag>verb aux.</ctag
 +      <syn> 
 +         <head>9</head> 
 +         <reltype>rel. aux.</reltype> 
 +      </syn> 
 +   </tok
 +   <tok> 
 +      <orth>decis</orth> 
 +      <ordno>9</ordno
 +      <ctag>verb</ctag> 
 +      <syn> 
 +         <head>19</head> 
 +         <reltype>predicat</reltype> 
 +      </syn
 +   </tok> 
 +   <tok> 
 +      <orth>ieri</orth> 
 +      <ordno>10</ordno> 
 +      <ctag>adverb</ctag
 +      <syn> 
 +         <head>9</head> 
 +         <reltype>complement circumst.</reltype> 
 +      </syn> 
 +   </tok
 +   <tok> 
 +      <orth>declansarea</orth> 
 +      <ordno>11</ordno> 
 +      <ctag>substantiv</ctag
 +      <syn> 
 +         <head>9</head> 
 +         <reltype>complement dir.</reltype> 
 +      </syn
 +   </tok> 
 +   <tok> 
 +      <orth>reorganizarii</orth> 
 +      <ordno>12</ordno> 
 +      <ctag>substantiv</ctag> 
 +      <syn> 
 +         <head>11</head> 
 +         <reltype>atribut subst.</reltype> 
 +      </syn> 
 +   </tok
 +   <tok> 
 +      <orth>judiciare</orth> 
 +      <ordno>13</ordno> 
 +      <ctag>adjectiv</ctag
 +      <syn> 
 +         <head>12</head> 
 +         <reltype>atribut adj.</reltype> 
 +      </syn
 +   </tok> 
 +   <tok> 
 +      <orth>a</orth> 
 +      <ordno>14</ordno> 
 +      <ctag>art. poses.</ctag
 +      <syn> 
 +         <head>12</head> 
 +         <reltype>atribut subst.</reltype> 
 +      </syn
 +   </tok> 
 +   <tok> 
 +      <orth>Combinatului</orth> 
 +      <ordno>15</ordno
 +      <ctag>substantiv</ctag> 
 +      <syn> 
 +         <head>14</head> 
 +         <reltype>rel. poses.</reltype
 +      </syn> 
 +   </tok> 
 +   <tok
 +      <orth>Siderurgic</orth> 
 +      <ordno>16</ordno> 
 +      <ctag>adjectiv</ctag> 
 +      <syn> 
 +         <head>15</head
 +         <reltype>atribut adj.</reltype> 
 +      </syn> 
 +   </tok> 
 +   <tok
 +      <orth>din</orth> 
 +      <ordno>17</ordno> 
 +      <ctag>prepozitie</ctag> 
 +      <syn
 +         <head>15</head> 
 +         <reltype>atribut subst.</reltype> 
 +      </syn
 +   </tok> 
 +   <tok> 
 +      <orth>Resita</orth> 
 +      <ordno>18</ordno> 
 +      <ctag>substantiv</ctag
 +      <syn> 
 +         <head>17</head> 
 +         <reltype>rel. prepoz.</reltype
 +      </syn> 
 +   </tok> 
 +</s></code>
  
 ==== Parsing ==== ==== Parsing ====
  
-Bosque is a mildly nonprojective treebank. 2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%). +The corpus is projective.
- +
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese: +
- +
-^ Parser (Authors) ^ LAS ^ UAS ^ +
-| MST (McDonald et al.) | 86.82 | 91.36 | +
-| Malt (Nivre et al.) | 87.60 | 91.22 | +
-| Nara (Yuchang Cheng) | 85.07 | 90.30 |+
  
 +I am not aware of any published evaluation of parsing accuracy on this data.

[ Back to the navigation ] [ Back to the content ]