[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Portuguese (pt)

Bosque (Floresta sintá(c)tica)

Versions

Obtaining and License

The CoNLL 2006 README file cites the project web site: “Floresta Sintá©tica (syntactic forest) is a publicly available treebank.” The cited English web page is no longer accessible and I am not able (in January 2012) to identify any license terms on the Portuguese web site (note: another part of Floresta sintá©tica, called Amazônia, comes under the Creative Commons Attribution Non-Commercial Share-Alike license; however, it does not seem that the same license applies to Bosque). Anyway, the treebanks continue to be freely available for download at http://www.linguateca.pt/Floresta/levantamento.html. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html.

I tend to interpret the “publicly available” statement in the following way:

Floresta sintá©tica is a joint project of Linguateca (people from Lisboa, Coimbra and Rio de Janeiro) and VISL (Visual Interactive Syntax Learning project, based at the Syddansk universitet).

References

Domain

Newspaper. The Alpino Treebank consists of “the full cdbl (newspaper) part of the Eindhoven corpus.”

Size

Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, the year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000).

The CoNLL 2006 version contains 200,654 tokens in 13735 sentences, yielding 14.61 tokens per sentence on average (CoNLL 2006 data split: 195,069 tokens / 13349 sentences training, 5585 tokens / 386 sentences test).

Inside

Texts from Portugal and Brasil.

The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).

In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file tagset.txt. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file syn_prot.pdf (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file diff.pdf (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units.

Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. “Economische_en_Monetaire_Unie”). They have special part-of-speech tags MWU, their subparts of speech and features may describe the individual parts of the unit. E.g. “aan_het” has CPOS MWU, (sub)POS Prep_Art and features voor_bep|onzijd|neut.

Sample

The first two sentences of the CoNLL 2006 training data:

1 Cathy Cathy N N eigen|ev|neut 2 su _ _
2 zag zie V V trans|ovt|1of2of3|ev 0 ROOT _ _
3 hen hen Pron Pron per|3|mv|datofacc 2 obj1 _ _
4 wild wild Adj Adj attr|stell|onverv 5 mod _ _
5 zwaaien zwaai N N soort|mv|neut 2 vc _ _
6 . . Punc Punc punt 5 punct _ _
1 Ze ze Pron Pron per|3|evofmv|nom 2 su _ _
2 had heb V V trans|ovt|1of2of3|ev 0 ROOT _ _
3 met met Prep Prep voor 8 mod _ _
4 haar haar Pron Pron bez|3|ev|neut|attr 5 det _ _
5 moeder moeder N N soort|ev|neut 3 obj1 _ _
6 kunnen kan V V hulp|ott|1of2of3|mv 2 vc _ _
7 gaan ga V V hulp|inf 6 vc _ _
8 winkelen winkel V V intrans|inf 11 cnj _ _
9 , , Punc Punc komma 8 punct _ _
10 zwemmen zwem V V intrans|inf 11 cnj _ _
11 of of Conj Conj neven 7 vc _ _
12 terrassen terras N N soort|mv|neut 11 cnj _ _
13 . . Punc Punc punt 12 punct _ _

The first two sentences of the CoNLL 2006 test data:

1 BASISTAKENPAKKET basis_taken_pakket Prep Prep voor 0 ROOT _ _
2 JEUGDGEZONDHEIDSZORG jeugd_gezondheid_zorg N N eigen|ev|neut 0 ROOT _ _
3 0-19 0-19 Num Num hoofd|bep|attr|onverv 4 det _ _
4 JAAR JAAR N N eigen|ev|neut 0 ROOT _ _
1 Daarvoor daarvoor Adv Adv pron|aanw 3 pc _ _
2 is ben V V hulpofkopp|ott|3|ev 0 ROOT _ _
3 gekozen kies V V trans|verldw|onverv 2 vc _ _
4 omdat omdat Conj Conj onder|metfin 3 mod _ _
5 gemeenten gemeente N N soort|mv|neut 11 su _ _
6 bij bij Prep Prep voor 12 mod _ _
7 uitstek uitstek N N soort|ev|neut 6 obj1 _ _
8 het het Art Art bep|onzijd|neut 10 det _ _
9 lokale lokaal Adj Adj attr|stell|vervneut 10 mod _ _
10 gezondheidsbeleid gezondheid_beleid N N soort|ev|neut 12 obj1 _ _
11 kunnen kan V V hulp|inf 4 body _ _
12 toespitsen spits_toe V V refl|inf 11 vc _ _
13 op op Prep Prep voor 12 pc _ _
14 de de Art Art bep|zijdofmv|neut 16 det _ _
15 specifieke specifiek Adj Adj attr|stell|vervneut 16 mod _ _
16 gezondheidssituatie gezondheid_situatie N N soort|ev|neut 17 cnj _ _
17 en en Conj Conj neven 13 obj1 _ _
18 zorgbehoeften zorg_behoefte N N soort|mv|neut 17 cnj _ _
19 van van Prep Prep voor 16 mod _ _
20 kinderen kind N N soort|mv|neut 21 cnj _ _
21 en en Conj Conj neven 19 obj1 _ _
22 jongeren jongere Adj Adj zelfst|vergr|vervneut 21 cnj _ _
23 in in Prep Prep voor 20 mod _ _
24 de de Art Art bep|zijdofmv|neut 26 det _ _
25 eigen eigen Pron Pron aanw|neut|attr|weigen 26 mod _ _
26 gemeente gemeente N N soort|ev|neut 23 obj1 _ _
27 . . Punc Punc punt 26 punct _ _

Parsing

Nonprojectivities in Alpino are quite frequent. 10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:

Parser (Authors) LAS UAS
MST (McDonald et al.) 79.19 83.57
Riedel et al. 78.59 82.91
Basis (John O'Neil) 77.51 81.73
Malt (Nivre et al.) 78.59 81.35

[ Back to the navigation ] [ Back to the content ]