Both sides previous revision
Previous revision
Next revision
|
Previous revision
Last revision
Both sides next revision
|
user:zeman:treebanks:pt [2012/01/11 11:11] zeman References. |
user:zeman:treebanks:pt [2012/01/11 11:28] zeman Inside. |
==== Domain ==== | ==== Domain ==== |
| |
Newspaper. The Alpino Treebank consists of “the full cdbl (newspaper) part of the Eindhoven corpus.” | Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000). |
| |
==== Size ==== | ==== Size ==== |
| |
Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, the year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000). | The CoNLL 2006 version contains 212,545 tokens in 9359 sentences, yielding 22.71 tokens per sentence on average (CoNLL 2006 data split: 206,678 tokens / 9071 sentences training, 5867 tokens / 288 sentences test). |
| |
The CoNLL 2006 version contains 200,654 tokens in 13735 sentences, yielding 14.61 tokens per sentence on average (CoNLL 2006 data split: 195,069 tokens / 13349 sentences training, 5585 tokens / 386 sentences test). | |
| |
==== Inside ==== | ==== Inside ==== |
| |
Texts from Portugal and Brasil. | The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised). |
| |
The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System "Palavras": Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised). | |
| |
In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file ''tagset.txt''. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file ''syn_prot.pdf'' (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file ''diff.pdf'' (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units. | Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the ''CPOS'', ''POS'' and ''FEAT'' columns of the CoNLL format. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=pt::conll|DZ Interset]] to inspect the CoNLL tagset. |
| |
Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "Economische_en_Monetaire_Unie"). They have special part-of-speech tags ''MWU'', their subparts of speech and features may describe the individual parts of the unit. E.g. "aan_het" has CPOS ''MWU'', (sub)POS ''Prep_Art'' and features ''voor_bep|onzijd|neut''. | Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. "7_e_Meio", "Hillary_Clinton"). |
| |
==== Sample ==== | ==== Sample ==== |
The first two sentences of the CoNLL 2006 training data: | The first two sentences of the CoNLL 2006 training data: |
| |
| 1 | Cathy | Cathy | N | N | <nowiki>eigen|ev|neut</nowiki> | 2 | su | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | Um | um | art | art | <nowiki><arti>|M|S</nowiki> | 2 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | zag | zie | V | V | <nowiki>trans|ovt|1of2of3|ev</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | revivalismo | revivalismo | n | n | <nowiki>M|S</nowiki> | 0 | UTT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | hen | hen | Pron | Pron | <nowiki>per|3|mv|datofacc</nowiki> | 2 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | refrescante | refrescante | adj | adj | <nowiki>M|S</nowiki> | 2 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | wild | wild | Adj | Adj | <nowiki>attr|stell|onverv</nowiki> | 5 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 5 | zwaaien | zwaai | N | N | <nowiki>soort|mv|neut</nowiki> | 2 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 6 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Punc | Punc | punt | 5 | punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |||||||||| | | |||||||||| |
| 1 | Ze | ze | Pron | Pron | <nowiki>per|3|evofmv|nom</nowiki> | 2 | su | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | O | o | art | art | <nowiki><artd>|M|S</nowiki> | 2 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | had | heb | V | V | <nowiki>trans|ovt|1of2of3|ev</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | <nowiki>7_e_Meio</nowiki> | <nowiki>7_e_Meio</nowiki> | prop | prop | <nowiki>M|S</nowiki> | 3 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | met | met | Prep | Prep | voor | 8 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | é | ser | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki> | 0 | STA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | haar | haar | Pron | Pron | <nowiki>bez|3|ev|neut|attr</nowiki> | 5 | det | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | um | um | art | art | <nowiki><arti>|M|S</nowiki> | 5 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | moeder | moeder | N | N | <nowiki>soort|ev|neut</nowiki> | 3 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 5 | <nowiki>ex-libris</nowiki> | <nowiki>ex-libris</nowiki> | n | n | <nowiki>M|P</nowiki> | 3 | SC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | kunnen | kan | V | V | <nowiki>hulp|ott|1of2of3|mv</nowiki> | 2 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 6 | de | de | prp | prp | <nowiki><sam-></nowiki> | 5 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | gaan | ga | V | V | <nowiki>hulp|inf</nowiki> | 6 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 7 | a | o | art | art | <nowiki><-sam>|<artd>|S</nowiki> | 8 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | winkelen | winkel | V | V | <nowiki>intrans|inf</nowiki> | 11 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 8 | noite | noite | n | n | <nowiki>F|S</nowiki> | 6 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | Punc | Punc | komma | 8 | punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 9 | algarvia | algarvio | adj | adj | <nowiki>F|S</nowiki> | 8 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | zwemmen | zwem | V | V | <nowiki>intrans|inf</nowiki> | 11 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 10 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | punc | punc | <nowiki>_</nowiki> | 3 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | of | of | Conj | Conj | neven | 7 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 12 | terrassen | terras | N | N | <nowiki>soort|mv|neut</nowiki> | 11 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 13 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Punc | Punc | punt | 12 | punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |
The first two sentences of the CoNLL 2006 test data: | The first two sentences of the CoNLL 2006 test data: |
| |
| 1 | BASISTAKENPAKKET | <nowiki>basis_taken_pakket</nowiki> | Prep | Prep | voor | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | É | é | adv | adv | <nowiki><foc></nowiki> | 9 | FOC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | JEUGDGEZONDHEIDSZORG | <nowiki>jeugd_gezondheid_zorg</nowiki> | N | N | <nowiki>eigen|ev|neut</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | por | por | prp | prp | <nowiki>_</nowiki> | 9 | ADVL | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | <nowiki>0-19</nowiki> | <nowiki>0-19</nowiki> | Num | Num | <nowiki>hoofd|bep|attr|onverv</nowiki> | 4 | det | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | isso | isso | pron | <nowiki>pron-indp</nowiki> | <nowiki><dem>|M|S</nowiki> | 2 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | JAAR | JAAR | N | N | <nowiki>eigen|ev|neut</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | que | que | adv | adv | <nowiki><foc></nowiki> | 9 | FOC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 5 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | punc | punc | <nowiki>_</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 6 | explica | explicar | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki> | 0 | STA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 7 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | punc | punc | <nowiki>_</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 8 | não | não | adv | adv | <nowiki>_</nowiki> | 9 | ADVL | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 9 | tem | ter | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki> | 6 | ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 10 | pena | pena | n | n | <nowiki>F|S</nowiki> | 9 | ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 11 | de | de | prp | prp | <nowiki>_</nowiki> | 10 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 12 | <nowiki>Hillary_Clinton</nowiki> | <nowiki>Hillary_Clinton</nowiki> | prop | prop | <nowiki>F|S</nowiki> | 11 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 13 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | punc | punc | <nowiki>_</nowiki> | 6 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | |||||||||| |
| 1 | Daarvoor | daarvoor | Adv | Adv | <nowiki>pron|aanw</nowiki> | 3 | pc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | <nowiki>«</nowiki> | <nowiki>«</nowiki> | punc | punc | <nowiki>_</nowiki> | 8 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | is | ben | V | V | <nowiki>hulpofkopp|ott|3|ev</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | Eles | ele | pron | <nowiki>pron-pers</nowiki> | <nowiki>M|3P|NOM</nowiki> | 8 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | gekozen | kies | V | V | <nowiki>trans|verldw|onverv</nowiki> | 2 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | <nowiki>[</nowiki> | <nowiki>[</nowiki> | punc | punc | <nowiki>_</nowiki> | 8 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | omdat | omdat | Conj | Conj | <nowiki>onder|metfin</nowiki> | 3 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | Hillary | Hillary | prop | prop | <nowiki>F|S</nowiki> | 9 | APP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | gemeenten | gemeente | N | N | <nowiki>soort|mv|neut</nowiki> | 11 | su | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 5 | e | e | conj | <nowiki>conj-c</nowiki> | <nowiki><co-app></nowiki> | 4 | CO | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | bij | bij | Prep | Prep | voor | 12 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 6 | <nowiki>Bill_Clinton</nowiki> | <nowiki>Bill_Clinton</nowiki> | prop | prop | <nowiki>M|S</nowiki> | 4 | CJT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | uitstek | uitstek | N | N | <nowiki>soort|ev|neut</nowiki> | 6 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 7 | <nowiki>]</nowiki> | <nowiki>]</nowiki> | punc | punc | <nowiki>_</nowiki> | 8 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | het | het | Art | Art | <nowiki>bep|onzijd|neut</nowiki> | 10 | det | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 8 | podem | poder | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3P|IND</nowiki> | 0 | QUE | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | lokale | lokaal | Adj | Adj | <nowiki>attr|stell|vervneut</nowiki> | 10 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 9 | ter | ter | v | <nowiki>v-inf</nowiki> | <nowiki>_</nowiki> | 8 | MV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | gezondheidsbeleid | <nowiki>gezondheid_beleid</nowiki> | N | N | <nowiki>soort|ev|neut</nowiki> | 12 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 10 | alguma | algum | pron | <nowiki>pron-det</nowiki> | <nowiki><quant>|F|S</nowiki> | 11 | <nowiki>>N</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | kunnen | kan | V | V | <nowiki>hulp|inf</nowiki> | 4 | body | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 11 | espécie | espécie | n | n | <nowiki>F|S</nowiki> | 9 | ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | toespitsen | <nowiki>spits_toe</nowiki> | V | V | <nowiki>refl|inf</nowiki> | 11 | vc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 12 | de | de | prp | prp | <nowiki>_</nowiki> | 11 | <nowiki>N<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 13 | op | op | Prep | Prep | voor | 12 | pc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 13 | acordo | acordo | n | n | <nowiki>M|S</nowiki> | 12 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 14 | de | de | Art | Art | <nowiki>bep|zijdofmv|neut</nowiki> | 16 | det | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 14 | e | e | conj | <nowiki>conj-c</nowiki> | <nowiki><co-vfin>|<co-fmc></nowiki> | 8 | CO | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 15 | specifieke | specifiek | Adj | Adj | <nowiki>attr|stell|vervneut</nowiki> | 16 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 15 | quem | quem | pron | <nowiki>pron-indp</nowiki> | <nowiki><interr>|M/F|P</nowiki> | 16 | SC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 16 | gezondheidssituatie | <nowiki>gezondheid_situatie</nowiki> | N | N | <nowiki>soort|ev|neut</nowiki> | 17 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 16 | somos | ser | v | <nowiki>v-fin</nowiki> | <nowiki>PR|1P|IND</nowiki> | 8 | CJT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 17 | en | en | Conj | Conj | neven | 13 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 17 | nós | nós | pron | <nowiki>pron-pers</nowiki> | <nowiki>M/F|1P|NOM</nowiki> | 16 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 18 | zorgbehoeften | <nowiki>zorg_behoefte</nowiki> | N | N | <nowiki>soort|mv|neut</nowiki> | 17 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 18 | para | para | prp | prp | <nowiki>_</nowiki> | 16 | ADVL | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 19 | van | van | Prep | Prep | voor | 16 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 19 | dizer | dizer | v | <nowiki>v-inf</nowiki> | <nowiki>_</nowiki> | 18 | <nowiki>P<</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 20 | kinderen | kind | N | N | <nowiki>soort|mv|neut</nowiki> | 21 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 20 | se | se | conj | <nowiki>conj-s</nowiki> | <nowiki>_</nowiki> | 21 | SUB | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 21 | en | en | Conj | Conj | neven | 19 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 21 | é | ser | v | <nowiki>v-fin</nowiki> | <nowiki>PR|3S|IND</nowiki> | 19 | ACC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 22 | jongeren | jongere | Adj | Adj | <nowiki>zelfst|vergr|vervneut</nowiki> | 21 | cnj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 22 | bom | bom | adj | adj | <nowiki>M|S</nowiki> | 21 | SC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 23 | in | in | Prep | Prep | voor | 20 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 23 | ou | ou | conj | <nowiki>conj-c</nowiki> | <nowiki><co-sc></nowiki> | 22 | CO | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 24 | de | de | Art | Art | <nowiki>bep|zijdofmv|neut</nowiki> | 26 | det | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 24 | mau | mau | adj | adj | <nowiki>M|S</nowiki> | 22 | CJT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 25 | eigen | eigen | Pron | Pron | <nowiki>aanw|neut|attr|weigen</nowiki> | 26 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 25 | <nowiki>?</nowiki> | <nowiki>?</nowiki> | punc | punc | <nowiki>_</nowiki> | 8 | PUNC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 26 | gemeente | gemeente | N | N | <nowiki>soort|ev|neut</nowiki> | 23 | obj1 | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 27 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Punc | Punc | punt | 26 | punct | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |
==== Parsing ==== | ==== Parsing ==== |