Table of Contents
Italian (it)
Italian Syntactic-Semantic Treebank (ISST) or Treebank Sintattico Semantica dell'Italiano (TreSSI)
Versions
- TreSSI (1999 – 2001)
- CoNLL 2007
Obtaining and License
There does not seem to be a regular distribution channel for the ISST after the CoNLL 2007 shared task. Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet. In the meantime, one can contact the authors / maintainers and inquire about the availability of the data, e.g. Simonetta Montemagni (simonetta (dot) montemagni (at) ilc (dot) cnr (dot) it).
The CoNLL 2007 license in short:
- research purposes
- no redistribution
- cite the principal publication (see below) in publications
ISST was created by members of the Istituto Linguistica Computazionale "Antonio Zampolli" (ILC), Consiglio Nazionale delle Ricerche (CNR), together with Venezia University/CVR, ITC-IRST, “Tor Vergata” University/CERTIA and Synthema. Conversion for the CoNLL 2007 shared task was done by Simonetta Montemagni and Maria Simi.
References
- Website
- Data
- no separate citation
- Principal publications
- Simonetta Montemagni, Francesco Barsotti, Marco Battista, Nicoletta Calzolari, Ornella Corazzari, Alessandro Lenci, Antonio Zampolli, Francesca Fanciulli, Maria Massetani, Remo Raffaelli, Roberto Basili, Maria Teresa Pazienza, Dario Saracino, Fabio Zanzotto, Nadia Mana, Fabio Pianesi, Rodolfo Delmonte: Building the Italian Syntactic-Semantic Treebank. In: Anne Abeillé (ed.): Building and using Parsed Corpora, pp. 189-210, Language and Speech series, Kluwer, Dordrecht, The Netherlands.
- Documentation
- The
doc/README
file in the CoNLL 2007 data distribution contains a quick guide to part of speech tags, morphosyntactic features and dependency relation labels. - Linea 1.1 Specifiche tecniche (in Italian), from page 115 on there is a description of the morphosyntactic tags.
- Linea 1.3 Manuale operativo e valutazione della Treebank (in Italian), from page 42 on there is a description of the dependency relations.
Domain
Newspapers (Corriere della Sera) and periodicals.
Size
According to the README file, ISST contains 305,547 word tokens. Only a fragment was converted to dependencies in the CoNLL 2007 version: 76295 tokens in 3359 sentences, yielding 22.71 tokens per sentence on average (71199 tokens / 3110 sentences training, 5096 tokens / 249 sentences test).
Inside
The original ISST is a phrase-based treebank. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.
Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.
Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. a_causa_di).
Sample
The first sentence of the CoNLL 2007 training data:
1 | Non | non | B | B | _ | 3 | mod | _ | _ |
2 | ci | ci | P | PQ | gen=N|num=P|per=1 | 3 | clit | _ | _ |
3 | rendiamo | rendere | V | V | num=P|per=1|mod=I|tmp=P | 0 | ROOT | _ | _ |
4 | conto | conto | S | S | gen=M|num=S | 3 | ogg_d | _ | _ |
5 | del | di | E | E | gen=M|num=S | 4 | mod | _ | _ |
6 | lavoro | lavoro | S | S | gen=M|num=S | 5 | prep | _ | _ |
7 | psicologico | psicologico | A | A | gen=M|num=S | 6 | mod | _ | _ |
8 | , | , | PU | PU | _ | 5 | con | _ | _ |
9 | dei | di | E | E | gen=M|num=P | 5 | cong | _ | _ |
10 | prodigi | prodigio | S | S | gen=M|num=P | 9 | prep | _ | _ |
11 | di | di | E | E | _ | 10 | mod | _ | _ |
12 | equilibrio | equilibrio | S | S | gen=M|num=S | 11 | prep | _ | _ |
13 | , | , | PU | PU | _ | 11 | con | _ | _ |
14 | di | di | E | E | _ | 11 | cong | _ | _ |
15 | diplomazia | diplomazia | S | S | gen=F|num=S | 14 | prep | _ | _ |
16 | che | che | P | PR | gen=N|num=N | 17 | ogg_d | _ | _ |
17 | fanno | fare | V | V | num=P|per=3|mod=I|tmp=P | 6 | mod_rel | _ | _ |
18 | per | per | E | E | _ | 17 | mod | _ | _ |
19 | noi | noi | P | PQ | gen=N|num=P|per=1 | 18 | prep | _ | _ |
20 | . | . | PU | PU | _ | 19 | punc | _ | _ |
The first two sentences of the CoNLL 2007 test data:
1 | LONDRA | londra | S | SP | gen=N|num=N | 0 | ROOT | _ | _ |
2 | . | . | PU | PU | _ | 1 | punc | _ | _ |
1 | Gas | gas | S | S | gen=M|num=N | 0 | ROOT | _ | _ |
2 | dalla | da | E | E | gen=F|num=S | 1 | mod | _ | _ |
3 | statua | statua | S | S | gen=F|num=S | 2 | prep | _ | _ |
4 | Evacuata | evacuare | V | V | gen=F|num=S|mod=P|tmp=R | 7 | mod | _ | _ |
5 | la | lo | R | RD | gen=F|num=S | 6 | det | _ | _ |
6 | Tate | tate | S | SP | gen=N|num=N | 7 | mod | _ | _ |
7 | Gallery | gallery | S | SP | gen=N|num=N | 0 | ROOT | _ | _ |
8 | . | . | PU | PU | _ | 7 | punc | _ | _ |
Parsing
Nonprojectivities in ISST-CoNLL are rare. 354 of the 76295 tokens of the CoNLL 2007 version are attached nonprojectively (0.46%).
The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Italian:
Parser (Authors) | LAS | UAS |
---|---|---|
Nakagawa | 83.61 | 87.91 |
Malt (Nilsson et al.) | 84.40 | 87.77 |
Sagae | 83.91 | 87.68 |
Carreras | 83.46 | 87.19 |
The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.