user:zeman:treebanks:it

Italian (it)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Italian (it)

Italian Syntactic-Semantic Treebank (ISST) or Treebank Sintattico Semantica dell'Italiano (TreSSI)

Versions

TreSSI (1999 – 2001)
CoNLL 2007

Obtaining and License

There does not seem to be a regular distribution channel for the ISST after the CoNLL 2007 shared task. Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet. In the meantime, one can contact the authors / maintainers and inquire about the availability of the data, e.g. Simonetta Montemagni (simonetta (dot) montemagni (at) ilc (dot) cnr (dot) it).

The CoNLL 2007 license in short:

research purposes
no redistribution
cite the principal publication (see below) in publications

ISST was created by members of the Istituto Linguistica Computazionale "Antonio Zampolli" (ILC), Consiglio Nazionale delle Ricerche (CNR), together with Venezia University/CVR, ITC-IRST, “Tor Vergata” University/CERTIA and Synthema. Conversion for the CoNLL 2007 shared task was done by Simonetta Montemagni and Maria Simi.

References

Website
- http://www.ilc.cnr.it/viewpage.php/sez=ricerca/id=874/vers=ing
Data
- no separate citation
Principal publications
- Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte: Building the Italian Syntactic-Semantic Treebank. In: Anne Abeillé (ed.): Building and using Parsed Corpora, pp. 189-210, Language and Speech series, Kluwer, Dordrecht, The Netherlands.
Documentation
- The doc/README file in the CoNLL 2007 data distribution contains a quick guide to part of speech tags, morphosyntactic features and dependency relation labels.
- Linea 1.1 Specifiche tecniche (in Italian), from page 115 on there is a description of the morphosyntactic tags.
- Linea 1.3 Manuale operativo e valutazione della Treebank (in Italian), from page 42 on there is a description of the dependency relations.

Domain

Newspapers (Corriere della Sera) and periodicals.

Size

According to the README file, ISST contains 305,547 word tokens. Only a fragment was converted to dependencies in the CoNLL 2007 version: 76295 tokens in 3359 sentences, yielding 22.71 tokens per sentence on average (71199 tokens / 3110 sentences training, 5096 tokens / 249 sentences test).

Inside

The original ISST is a phrase-based treebank. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.

Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.

Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. a_causa_di).

Sample

The first sentence of the CoNLL 2007 training data:

1	Non	non	B	B	_	3	mod	_	_
2	ci	ci	P	PQ	gen=N\|num=P\|per=1	3	clit	_	_
3	rendiamo	rendere	V	V	num=P\|per=1\|mod=I\|tmp=P	0	ROOT	_	_
4	conto	conto	S	S	gen=M\|num=S	3	ogg_d	_	_
5	del	di	E	E	gen=M\|num=S	4	mod	_	_
6	lavoro	lavoro	S	S	gen=M\|num=S	5	prep	_	_
7	psicologico	psicologico	A	A	gen=M\|num=S	6	mod	_	_
8	,	,	PU	PU	_	5	con	_	_
9	dei	di	E	E	gen=M\|num=P	5	cong	_	_
10	prodigi	prodigio	S	S	gen=M\|num=P	9	prep	_	_
11	di	di	E	E	_	10	mod	_	_
12	equilibrio	equilibrio	S	S	gen=M\|num=S	11	prep	_	_
13	,	,	PU	PU	_	11	con	_	_
14	di	di	E	E	_	11	cong	_	_
15	diplomazia	diplomazia	S	S	gen=F\|num=S	14	prep	_	_
16	che	che	P	PR	gen=N\|num=N	17	ogg_d	_	_
17	fanno	fare	V	V	num=P\|per=3\|mod=I\|tmp=P	6	mod_rel	_	_
18	per	per	E	E	_	17	mod	_	_
19	noi	noi	P	PQ	gen=N\|num=P\|per=1	18	prep	_	_
20	.	.	PU	PU	_	19	punc	_	_

The first two sentences of the CoNLL 2007 test data:

1	LONDRA	londra	S	SP	gen=N\|num=N	0	ROOT	_	_
2	.	.	PU	PU	_	1	punc	_	_

1	Gas	gas	S	S	gen=M\|num=N	0	ROOT	_	_
2	dalla	da	E	E	gen=F\|num=S	1	mod	_	_
3	statua	statua	S	S	gen=F\|num=S	2	prep	_	_
4	Evacuata	evacuare	V	V	gen=F\|num=S\|mod=P\|tmp=R	7	mod	_	_
5	la	lo	R	RD	gen=F\|num=S	6	det	_	_
6	Tate	tate	S	SP	gen=N\|num=N	7	mod	_	_
7	Gallery	gallery	S	SP	gen=N\|num=N	0	ROOT	_	_
8	.	.	PU	PU	_	7	punc	_	_

Parsing

Nonprojectivities in ISST-CoNLL are rare. 354 of the 76295 tokens of the CoNLL 2007 version are attached nonprojectively (0.46%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Italian:

Parser (Authors)	LAS	UAS
Nakagawa	83.61	87.91
Malt (Nilsson et al.)	84.40	87.77
Sagae	83.91	87.68
Carreras	83.46	87.19

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

Table of Contents