[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Italian (it)

Italian Syntactic-Semantic Treebank (ISST) or Treebank Sintattico Semantica dell'Italiano (TreSSI)

Versions

Obtaining and License

There does not seem to be a regular distribution channel for the ISST after the CoNLL 2007 shared task. Republication of the CoNLL 2007 version in the LDC is planned but it has not happened yet. In the meantime, one can contact the authors / maintainers and inquire about the availability of the data, e.g. Simonetta Montemagni (simonetta (dot) montemagni (at) ilc (dot) cnr (dot) it).

The CoNLL 2007 license in short:

ISST was created by members of the Istituto Linguistica Computazionale "Antonio Zampolli" (ILC), Consiglio Nazionale delle Ricerche (CNR), together with Venezia University/CVR, ITC-IRST, “Tor Vergata” University/CERTIA and Synthema. Conversion for the CoNLL 2007 shared task was done by Simonetta Montemagni and Maria Simi.

References

Domain

Newspapers (Corriere della Sera) and periodicals.

Size

According to the README file, ISST contains 305,547 word tokens. Only a fragment was converted to dependencies in the CoNLL 2007 version: 76295 tokens in 3359 sentences, yielding 22.71 tokens per sentence on average (71199 tokens / 3110 sentences training, 5096 tokens / 249 sentences test).

Inside

The original ISST is a phrase-based treebank. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.

Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.

Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. a_causa_di).

Sample

The first sentence of the CoNLL 2007 training data:

1 Non non B B _ 3 mod _ _
2 ci ci P PQ gen=N|num=P|per=1 3 clit _ _
3 rendiamo rendere V V num=P|per=1|mod=I|tmp=P 0 ROOT _ _
4 conto conto S S gen=M|num=S 3 ogg_d _ _
5 del di E E gen=M|num=S 4 mod _ _
6 lavoro lavoro S S gen=M|num=S 5 prep _ _
7 psicologico psicologico A A gen=M|num=S 6 mod _ _
8 , , PU PU _ 5 con _ _
9 dei di E E gen=M|num=P 5 cong _ _
10 prodigi prodigio S S gen=M|num=P 9 prep _ _
11 di di E E _ 10 mod _ _
12 equilibrio equilibrio S S gen=M|num=S 11 prep _ _
13 , , PU PU _ 11 con _ _
14 di di E E _ 11 cong _ _
15 diplomazia diplomazia S S gen=F|num=S 14 prep _ _
16 che che P PR gen=N|num=N 17 ogg_d _ _
17 fanno fare V V num=P|per=3|mod=I|tmp=P 6 mod_rel _ _
18 per per E E _ 17 mod _ _
19 noi noi P PQ gen=N|num=P|per=1 18 prep _ _
20 . . PU PU _ 19 punc _ _

The first two sentences of the CoNLL 2007 test data:

1 LONDRA londra S SP gen=N|num=N 0 ROOT _ _
2 . . PU PU _ 1 punc _ _
1 Gas gas S S gen=M|num=N 0 ROOT _ _
2 dalla da E E gen=F|num=S 1 mod _ _
3 statua statua S S gen=F|num=S 2 prep _ _
4 Evacuata evacuare V V gen=F|num=S|mod=P|tmp=R 7 mod _ _
5 la lo R RD gen=F|num=S 6 det _ _
6 Tate tate S SP gen=N|num=N 7 mod _ _
7 Gallery gallery S SP gen=N|num=N 0 ROOT _ _
8 . . PU PU _ 7 punc _ _

Parsing

Nonprojectivities in ISST-CoNLL are rare. 354 of the 76295 tokens of the CoNLL 2007 version are attached nonprojectively (0.46%).

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Italian:

Parser (Authors) LAS UAS
Nakagawa 83.61 87.91
Malt (Nilsson et al.) 84.40 87.77
Sagae 83.91 87.68
Carreras 83.46 87.19

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.


[ Back to the navigation ] [ Back to the content ]