user:zeman:treebanks:pt [ufal wiki]

Bosque 7.3 in its own text format
CoNLL 2006 (based on Bosque 7.3)

The CoNLL 2006 README file cites the project web site: “Floresta Sintá©tica (syntactic forest) is a publicly available treebank.” The cited English web page is no longer accessible and I am not able (in January 2012) to identify any license terms on the Portuguese web site (note: another part of Floresta sintá©tica, called Amazônia, comes under the Creative Commons Attribution Non-Commercial Share-Alike license; however, it does not seem that the same license applies to Bosque). Anyway, the treebanks continue to be freely available for download at http://www.linguateca.pt/Floresta/levantamento.html. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html.

I tend to interpret the “publicly available” statement in the following way:

any usage, commercial or not
modification and redistribution under a free license permitted, provided the original source is mentioned
citation in publications not required (but it is common decency)

Floresta sintá©tica is a joint project of Linguateca (people from Lisboa, Coimbra and Rio de Janeiro) and VISL (Visual Interactive Syntax Learning project, based at the Syddansk universitet).

Website
- http://www.linguateca.pt/Floresta/principal.html (Floresta)
- http://ilk.uvt.nl/conll/free_data.html (CoNLL 2006)
Data
- no separate citation
Principal publications
- Susana Afonso, Eckhard Bick, Renato Haber, Diana Santos: Floresta sintá(c)tica: um treebank para o português. In: Encontro da associação portuguesa de linguística, XVII, Lisboa, 2001.
- Cláudia Freitas, Paulo Rocha, Eckhard Bick: Um mundo novo na Floresta Sintá(c)tica - o treebank para Português. Calidoscópio - Revista de Pós Graduação em Lingüística Aplicada da Unisinos, Rio Grande do Sul 6.3 (2008), pp. 142-148.
Documentation
- Documentation
- Cláudia Freitas, Susana Afonso: Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica, 2008
- Glossário de etiquetas florestais (glossary of tags)
- Statistics of morphosyntactic tags

Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000).

The CoNLL 2006 version contains 212,545 tokens in 9359 sentences, yielding 22.71 tokens per sentence on average (CoNLL 2006 data split: 206,678 tokens / 9071 sentences training, 5867 tokens / 288 sentences test).

Texts from Portugal and Brasil.

The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).

In the CoNLL version, the original POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file tagset.txt. The morphological annotation includes lemmas. The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file syn_prot.pdf (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file diff.pdf (which is heavily out of date, but the number of differences has been reduced). Conversion issues: head selection, multi-word units, discourse units.

Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. “Economische_en_Monetaire_Unie”). They have special part-of-speech tags MWU, their subparts of speech and features may describe the individual parts of the unit. E.g. “aan_het” has CPOS MWU, (sub)POS Prep_Art and features voor_bep|onzijd|neut.

The first two sentences of the CoNLL 2006 training data:

1	Um	um	art	art	<arti>\|M\|S	2	>N	_	_
2	revivalismo	revivalismo	n	n	M\|S	0	UTT	_	_
3	refrescante	refrescante	adj	adj	M\|S	2	N<	_	_

1	O	o	art	art	<artd>\|M\|S	2	>N	_	_
2	7_e_Meio	7_e_Meio	prop	prop	M\|S	3	SUBJ	_	_
3	é	ser	v	v-fin	PR\|3S\|IND	0	STA	_	_
4	um	um	art	art	<arti>\|M\|S	5	>N	_	_
5	ex-libris	ex-libris	n	n	M\|P	3	SC	_	_
6	de	de	prp	prp	<sam->	5	N<	_	_
7	a	o	art	art	<-sam>\|<artd>\|S	8	>N	_	_
8	noite	noite	n	n	F\|S	6	P<	_	_
9	algarvia	algarvio	adj	adj	F\|S	8	N<	_	_
10	.	.	punc	punc	_	3	PUNC	_	_

The first two sentences of the CoNLL 2006 test data:

1	É	é	adv	adv	<foc>	9	FOC	_	_
2	por	por	prp	prp	_	9	ADVL	_	_
3	isso	isso	pron	pron-indp	<dem>\|M\|S	2	P<	_	_
4	que	que	adv	adv	<foc>	9	FOC	_	_
5	,	,	punc	punc	_	6	PUNC	_	_
6	explica	explicar	v	v-fin	PR\|3S\|IND	0	STA	_	_
7	,	,	punc	punc	_	6	PUNC	_	_
8	não	não	adv	adv	_	9	ADVL	_	_
9	tem	ter	v	v-fin	PR\|3S\|IND	6	ACC	_	_
10	pena	pena	n	n	F\|S	9	ACC	_	_
11	de	de	prp	prp	_	10	N<	_	_
12	Hillary_Clinton	Hillary_Clinton	prop	prop	F\|S	11	P<	_	_
13	.	.	punc	punc	_	6	PUNC	_	_

1	«	«	punc	punc	_	8	PUNC	_	_
2	Eles	ele	pron	pron-pers	M\|3P\|NOM	8	SUBJ	_	_
3	[	[	punc	punc	_	8	PUNC	_	_
4	Hillary	Hillary	prop	prop	F\|S	9	APP	_	_
5	e	e	conj	conj-c	<co-app>	4	CO	_	_
6	Bill_Clinton	Bill_Clinton	prop	prop	M\|S	4	CJT	_	_
7	]	]	punc	punc	_	8	PUNC	_	_
8	podem	poder	v	v-fin	PR\|3P\|IND	0	QUE	_	_
9	ter	ter	v	v-inf	_	8	MV	_	_
10	alguma	algum	pron	pron-det	<quant>\|F\|S	11	>N	_	_
11	espécie	espécie	n	n	F\|S	9	ACC	_	_
12	de	de	prp	prp	_	11	N<	_	_
13	acordo	acordo	n	n	M\|S	12	P<	_	_
14	e	e	conj	conj-c	<co-vfin>\|<co-fmc>	8	CO	_	_
15	quem	quem	pron	pron-indp	<interr>\|M/F\|P	16	SC	_	_
16	somos	ser	v	v-fin	PR\|1P\|IND	8	CJT	_	_
17	nós	nós	pron	pron-pers	M/F\|1P\|NOM	16	SUBJ	_	_
18	para	para	prp	prp	_	16	ADVL	_	_
19	dizer	dizer	v	v-inf	_	18	P<	_	_
20	se	se	conj	conj-s	_	21	SUB	_	_
21	é	ser	v	v-fin	PR\|3S\|IND	19	ACC	_	_
22	bom	bom	adj	adj	M\|S	21	SC	_	_
23	ou	ou	conj	conj-c	<co-sc>	22	CO	_	_
24	mau	mau	adj	adj	M\|S	22	CJT	_	_
25	?	?	punc	punc	_	8	PUNC	_	_

Nonprojectivities in Alpino are quite frequent. 10858 of the 200,654 tokens in the CoNLL 2006 version are attached nonprojectively (5.41%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:

Parser (Authors)	LAS	UAS
MST (McDonald et al.)	79.19	83.57
Riedel et al.	78.59	82.91
Basis (John O'Neil)	77.51	81.73
Malt (Nivre et al.)	78.59	81.35

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Portuguese (pt)

Versions

Obtaining and License

References

Domain

Size

Inside

Sample

Parsing