The CoNLL 2006 README file cites the project web site: “Floresta Sintá©tica (syntactic forest) is a publicly available treebank.” The cited English web page is no longer accessible and I am not able (in January 2012) to identify any license terms on the Portuguese web site (note: another part of Floresta sintá©tica, called Amazônia, comes under the Creative Commons Attribution Non-Commercial Share-Alike license; however, it does not seem that the same license applies to Bosque). Anyway, the treebanks continue to be freely available for download at http://www.linguateca.pt/Floresta/levantamento.html. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html.
I tend to interpret the “publicly available” statement in the following way:
Floresta sintá©tica is a joint project of Linguateca (people from Lisboa, Coimbra and Rio de Janeiro) and VISL (Visual Interactive Syntax Learning project, based at the Syddansk universitet).
Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000).
The CoNLL 2006 version contains 212,545 tokens in 9359 sentences, yielding 22.71 tokens per sentence on average (CoNLL 2006 data split: 206,678 tokens / 9071 sentences training, 5867 tokens / 288 sentences test).
The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).
Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the CPOS
, POS
and FEAT
columns of the CoNLL format. Use DZ Interset to inspect the CoNLL tagset.
Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. “7_e_Meio”, “Hillary_Clinton”).
The first two sentences of the CoNLL 2006 training data:
1 | Um | um | art | art | <arti>|M|S | 2 | >N | _ | _ |
2 | revivalismo | revivalismo | n | n | M|S | 0 | UTT | _ | _ |
3 | refrescante | refrescante | adj | adj | M|S | 2 | N< | _ | _ |
1 | O | o | art | art | <artd>|M|S | 2 | >N | _ | _ |
2 | 7_e_Meio | 7_e_Meio | prop | prop | M|S | 3 | SUBJ | _ | _ |
3 | é | ser | v | v-fin | PR|3S|IND | 0 | STA | _ | _ |
4 | um | um | art | art | <arti>|M|S | 5 | >N | _ | _ |
5 | ex-libris | ex-libris | n | n | M|P | 3 | SC | _ | _ |
6 | de | de | prp | prp | <sam-> | 5 | N< | _ | _ |
7 | a | o | art | art | <-sam>|<artd>|S | 8 | >N | _ | _ |
8 | noite | noite | n | n | F|S | 6 | P< | _ | _ |
9 | algarvia | algarvio | adj | adj | F|S | 8 | N< | _ | _ |
10 | . | . | punc | punc | _ | 3 | PUNC | _ | _ |
The first two sentences of the CoNLL 2006 test data:
1 | É | é | adv | adv | <foc> | 9 | FOC | _ | _ |
2 | por | por | prp | prp | _ | 9 | ADVL | _ | _ |
3 | isso | isso | pron | pron-indp | <dem>|M|S | 2 | P< | _ | _ |
4 | que | que | adv | adv | <foc> | 9 | FOC | _ | _ |
5 | , | , | punc | punc | _ | 6 | PUNC | _ | _ |
6 | explica | explicar | v | v-fin | PR|3S|IND | 0 | STA | _ | _ |
7 | , | , | punc | punc | _ | 6 | PUNC | _ | _ |
8 | não | não | adv | adv | _ | 9 | ADVL | _ | _ |
9 | tem | ter | v | v-fin | PR|3S|IND | 6 | ACC | _ | _ |
10 | pena | pena | n | n | F|S | 9 | ACC | _ | _ |
11 | de | de | prp | prp | _ | 10 | N< | _ | _ |
12 | Hillary_Clinton | Hillary_Clinton | prop | prop | F|S | 11 | P< | _ | _ |
13 | . | . | punc | punc | _ | 6 | PUNC | _ | _ |
1 | « | « | punc | punc | _ | 8 | PUNC | _ | _ |
2 | Eles | ele | pron | pron-pers | M|3P|NOM | 8 | SUBJ | _ | _ |
3 | [ | [ | punc | punc | _ | 8 | PUNC | _ | _ |
4 | Hillary | Hillary | prop | prop | F|S | 9 | APP | _ | _ |
5 | e | e | conj | conj-c | <co-app> | 4 | CO | _ | _ |
6 | Bill_Clinton | Bill_Clinton | prop | prop | M|S | 4 | CJT | _ | _ |
7 | ] | ] | punc | punc | _ | 8 | PUNC | _ | _ |
8 | podem | poder | v | v-fin | PR|3P|IND | 0 | QUE | _ | _ |
9 | ter | ter | v | v-inf | _ | 8 | MV | _ | _ |
10 | alguma | algum | pron | pron-det | <quant>|F|S | 11 | >N | _ | _ |
11 | espécie | espécie | n | n | F|S | 9 | ACC | _ | _ |
12 | de | de | prp | prp | _ | 11 | N< | _ | _ |
13 | acordo | acordo | n | n | M|S | 12 | P< | _ | _ |
14 | e | e | conj | conj-c | <co-vfin>|<co-fmc> | 8 | CO | _ | _ |
15 | quem | quem | pron | pron-indp | <interr>|M/F|P | 16 | SC | _ | _ |
16 | somos | ser | v | v-fin | PR|1P|IND | 8 | CJT | _ | _ |
17 | nós | nós | pron | pron-pers | M/F|1P|NOM | 16 | SUBJ | _ | _ |
18 | para | para | prp | prp | _ | 16 | ADVL | _ | _ |
19 | dizer | dizer | v | v-inf | _ | 18 | P< | _ | _ |
20 | se | se | conj | conj-s | _ | 21 | SUB | _ | _ |
21 | é | ser | v | v-fin | PR|3S|IND | 19 | ACC | _ | _ |
22 | bom | bom | adj | adj | M|S | 21 | SC | _ | _ |
23 | ou | ou | conj | conj-c | <co-sc> | 22 | CO | _ | _ |
24 | mau | mau | adj | adj | M|S | 22 | CJT | _ | _ |
25 | ? | ? | punc | punc | _ | 8 | PUNC | _ | _ |
Bosque is a mildly nonprojective treebank. 2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 86.82 | 91.36 |
Malt (Nivre et al.) | 87.60 | 91.22 |
Nara (Yuchang Cheng) | 85.07 | 90.30 |