Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:it [2012/01/03 15:02] zeman README file. |
user:zeman:treebanks:it [2012/01/03 15:48] (current) zeman Parsing results. |
==== Domain ==== | ==== Domain ==== |
| |
Mixed: | Newspapers (Corriere della Sera) and periodicals. |
* Fiction | |
* Short essays by 14 to 16 year-old students | |
* Newspapers (Népszabadság, Népszava, Magyar Hírlap, HVG) | |
* Texts related to computer science | |
* Legal texts | |
* Economic and financial short news | |
| |
==== Size ==== | ==== Size ==== |
| |
According to their website, SzTB 2.0 contains 1.2 million words plus 250 thousand punctuation tokens in 82000 sentences. Only a fragment was converted to dependencies in the CoNLL 2007 version: 139,143 tokens in 6424 sentences, yielding 21.66 tokens per sentence on average (131,799 tokens / 6034 sentences training, 7344 tokens / 390 sentences test). | According to the README file, ISST contains 305,547 word tokens. Only a fragment was converted to dependencies in the CoNLL 2007 version: 76295 tokens in 3359 sentences, yielding 22.71 tokens per sentence on average (71199 tokens / 3110 sentences training, 5096 tokens / 249 sentences test). |
| |
==== Inside ==== | ==== Inside ==== |
| |
The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format. | The original ISST is a phrase-based treebank. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format. |
| |
Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to [[http://nl.ijs.si/ME/V4/msd/html/msd-hu.html|Multext-East]]. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column. | Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column. |
| |
Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József). | Multi-word expressions have been collapsed into one token, using underscore as the joining character (e.g. a_causa_di). |
| |
==== Sample ==== | ==== Sample ==== |
The first sentence of the CoNLL 2007 training data: | The first sentence of the CoNLL 2007 training data: |
| |
| 1 | Az | az | T | Tf | <nowiki>def=yes</nowiki> | 4 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | Non | non | B | B | <nowiki>_</nowiki> | 3 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | elmúlt | elmúlt | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 4 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | ci | ci | P | PQ | <nowiki>gen=N|num=P|per=1</nowiki> | 3 | clit | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | nyolc | nyolc | M | Mc | <nowiki>n=singular|case=nominative</nowiki> | 4 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | rendiamo | rendere | V | V | <nowiki>num=P|per=1|mod=I|tmp=P</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | hónapban | hónap | N | Nc | <nowiki>n=singular|case=inessive|proper=no</nowiki> | 16 | INE | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | conto | conto | S | S | <nowiki>gen=M|num=S</nowiki> | 3 | <nowiki>ogg_d</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 5 | del | di | E | E | <nowiki>gen=M|num=S</nowiki> | 4 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | amelyből | amely | P | Pr | <nowiki>p=3rd|n=singular|case=elative</nowiki> | 11 | ELA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 6 | lavoro | lavoro | S | S | <nowiki>gen=M|num=S</nowiki> | 5 | prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | összesen | összesen | R | Rx | <nowiki>_</nowiki> | 8 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 7 | psicologico | psicologico | A | A | <nowiki>gen=M|num=S</nowiki> | 6 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | hatot | hat | M | Mc | <nowiki>n=singular|case=accusative</nowiki> | 11 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 8 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | PU | PU | <nowiki>_</nowiki> | 5 | con | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | kényszerűségből | kényszerűség | N | Nc | <nowiki>n=singular|case=elative|proper=no</nowiki> | 11 | ELA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 9 | dei | di | E | E | <nowiki>gen=M|num=P</nowiki> | 5 | cong | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | szabadságon | szabadság | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 11 | SUP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 10 | prodigi | prodigio | S | S | <nowiki>gen=M|num=P</nowiki> | 9 | prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | töltött | tölt | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=no</nowiki> | 16 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 11 | di | di | E | E | <nowiki>_</nowiki> | 10 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 14 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 12 | equilibrio | equilibrio | S | S | <nowiki>gen=M|num=S</nowiki> | 11 | prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 13 | parlamenti | parlamenti | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 14 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 13 | <nowiki>,</nowiki> | <nowiki>,</nowiki> | PU | PU | <nowiki>_</nowiki> | 11 | con | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 14 | ellenzék | ellenzék | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 11 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 14 | di | di | E | E | <nowiki>_</nowiki> | 11 | cong | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 15 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 15 | diplomazia | diplomazia | S | S | <nowiki>gen=F|num=S</nowiki> | 14 | prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 16 | megváltozott | megváltozik | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=no</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 16 | che | che | P | PR | <nowiki>gen=N|num=N</nowiki> | 17 | <nowiki>ogg_d</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 17 | itthon | itthon | R | Rx | <nowiki>_</nowiki> | 16 | LOCY | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 17 | fanno | fare | V | V | <nowiki>num=P|per=3|mod=I|tmp=P</nowiki> | 6 | <nowiki>mod_rel</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 18 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 19 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 18 | per | per | E | E | <nowiki>_</nowiki> | 17 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 19 | hatalommegosztás | hatalommegosztás | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 22 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 19 | noi | noi | P | PQ | <nowiki>gen=N|num=P|per=1</nowiki> | 18 | prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 20 | <nowiki>1990-ben</nowiki> | 1990 | M | Mc | <nowiki>n=singular|case=inessive</nowiki> | 21 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 20 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PU | PU | <nowiki>_</nowiki> | 19 | punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 21 | kialakított | kialakított | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 22 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 22 | rendszere | rendszer | N | Nc | <nowiki>n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 16 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 23 | <nowiki>:</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 24 | az | az | T | Tf | <nowiki>def=yes</nowiki> | 26 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 25 | e | e | P | Pd | <nowiki>p=3rd|n=singular|case=nominative</nowiki> | 26 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 26 | héten | hét | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 27 | audienciát | audiencia | N | Nc | <nowiki>n=singular|case=accusative|proper=no</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 28 | tartó | tartó | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 29 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 29 | kormányfő | kormányfő | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 31 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 30 | gyakorlatilag | gyakorlati | A | Af | <nowiki>deg=positive|n=singular|case=essive</nowiki> | 31 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 31 | kivonta | kivon | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=yes</nowiki> | 16 | CP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 32 | magát | maga | P | Px | <nowiki>p=3rd|n=singular|case=accusative</nowiki> | 31 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 33 | az | az | T | Tf | <nowiki>def=yes</nowiki> | 34 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 34 | Országgyűlés | Országgyűlés | N | Np | <nowiki>n=singular|case=nominative|proper=yes</nowiki> | 35 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 35 | ellenőrzése | ellenőrzés | N | Nc | <nowiki>n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 36 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 36 | alól | alól | S | St | <nowiki>_</nowiki> | 31 | PP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 37 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | SPUNCT | SPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |
The first sentence of the CoNLL 2007 test data: | The first two sentences of the CoNLL 2007 test data: |
| |
| 1 | A | a | T | Tf | <nowiki>def=yes</nowiki> | 2 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | LONDRA | londra | S | SP | <nowiki>gen=N|num=N</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | bankokkal | bank | N | Nc | <nowiki>n=plural|case=instrumental|proper=no</nowiki> | 4 | INS | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PU | PU | <nowiki>_</nowiki> | 1 | punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | kell | kell | V | Vm | <nowiki>mood=indicative|t=present|p=3rd|n=singular|def=no</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | |||||||||| |
| 4 | egyezkedniük | egyezkedik | V | Vm | <nowiki>mood=infinitive|t=present|p=3rd|n=plural</nowiki> | 3 | INF | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | Gas | gas | S | S | <nowiki>gen=M|num=N</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | azoknak | az | P | Pd | <nowiki>p=3rd|n=plural|case=dative</nowiki> | 8 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | dalla | da | E | E | <nowiki>gen=F|num=S</nowiki> | 1 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 8 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | statua | statua | S | S | <nowiki>gen=F|num=S</nowiki> | 2 | prep | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | mezőgazdasági | mezőgazdasági | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 8 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | Evacuata | evacuare | V | V | <nowiki>gen=F|num=S|mod=P|tmp=R</nowiki> | 7 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | termelőknek | termelő | N | Nc | <nowiki>n=plural|case=dative|proper=no</nowiki> | 4 | DAT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 5 | la | lo | R | RD | <nowiki>gen=F|num=S</nowiki> | 6 | det | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 3 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 6 | Tate | tate | S | SP | <nowiki>gen=N|num=N</nowiki> | 7 | mod | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | akik | aki | P | Pr | <nowiki>p=3rd|n=plural|case=nominative</nowiki> | 21 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 7 | Gallery | gallery | S | SP | <nowiki>gen=N|num=N</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | egy | egy | T | Ti | <nowiki>def=no</nowiki> | 19 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 8 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | PU | PU | <nowiki>_</nowiki> | 7 | punc | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 19 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 13 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 15 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 14 | múlt | múlt | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 15 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 15 | héten | hét | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 16 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 16 | megjelent | megjelent | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 19 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 17 | földművelésügyi | földművelésügyi | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 18 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 18 | minisztériumi | minisztériumi | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 19 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 19 | rendelet | rendelet | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 20 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 20 | alapján | alap | N | Nc | <nowiki>n=singular|case=superessive|proper=no|pperson=3rd|pnumber=singular</nowiki> | 21 | SUP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 21 | kérik | kér | V | Vm | <nowiki>mood=indicative|t=present|p=3rd|n=plural|def=yes</nowiki> | 5 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 22 | ősszel | ősszel | R | Rx | <nowiki>_</nowiki> | 23 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 23 | lejáró | lejáró | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 27 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 24 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 27 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 25 | éven | év | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 26 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 26 | belüli | belüli | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 27 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 27 | hiteleik | hitel | N | Nc | <nowiki>n=plural|case=nominative|proper=no|pperson=3rd|pnumber=plural</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 28 | átütemezését | átütemezés | N | Nc | <nowiki>n=singular|case=accusative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 21 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 29 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | SPUNCT | SPUNCT | <nowiki>_</nowiki> | 3 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
SzTB is a mildly nonprojective treebank. 4032 of the 139,143 tokens of the CoNLL 2007 version are attached nonprojectively (2.9%). | Nonprojectivities in ISST-CoNLL are rare. 354 of the 76295 tokens of the CoNLL 2007 version are attached nonprojectively (0.46%). |
| |
The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Hungarian: | The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Italian: |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| Malt (Nilsson et al.) | 80.27 | 83.55 | | | Nakagawa | 83.61 | 87.91 | |
| Sagae | 79.53 | 83.51 | | | Malt (Nilsson et al.) | 84.40 | 87.77 | |
| Nakagawa | 76.74 | 82.49 | | | Sagae | 83.91 | 87.68 | |
| Titov et al. | 77.94 | 82.18 | | | Carreras | 83.46 | 87.19 | |
| |
The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. | The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. |
| |