Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
user:zeman:treebanks:tr [2012/03/22 20:43] zeman ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi |
user:zeman:treebanks:tr [2012/03/22 21:04] zeman Inside. |
==== Domain ==== | ==== Domain ==== |
| |
Mixed: | Post-1990 written Turkish, sampled from various genres. |
* Fiction | |
* Short essays by 14 to 16 year-old students | |
* Newspapers (Népszabadság, Népszava, Magyar Hírlap, HVG) | |
* Texts related to computer science | |
* Legal texts | |
* Economic and financial short news | |
| |
==== Size ==== | ==== Size ==== |
| |
According to their website, SzTB 2.0 contains 1.2 million words plus 250 thousand punctuation tokens in 82000 sentences. Only a fragment was converted to dependencies in the CoNLL 2007 version: 139,143 tokens in 6424 sentences, yielding 21.66 tokens per sentence on average (131,799 tokens / 6034 sentences training, 7344 tokens / 390 sentences test). | According to their website, the treebank contains 7262 sentences. The CoNLL 2007 version contains 69695 tokens in 5935 sentences, yielding 11.74 tokens per sentence on average (65182 tokens / 5635 sentences training, 4513 tokens / 300 sentences test). |
| |
==== Inside ==== | ==== Inside ==== |
| |
The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format. | The original METU-Sabanci Treebank is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is distributed in the [[:format-conll|CoNLL 2006/2007 format]]. |
| |
Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to [[http://nl.ijs.si/ME/V4/msd/html/msd-hu.html|Multext-East]]. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column. | Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. |
| |
Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József). | There are special derivational nodes. Derived words have been split into several tokens (see also the sample below). |
| |
==== Sample ==== | ==== Sample ==== |
The first sentence of the CoNLL 2007 training data: | The first sentence of the CoNLL 2007 training data: |
| |
| 1 | Az | az | T | Tf | <nowiki>def=yes</nowiki> | 4 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | Ama | ama | Conj | Conj | <nowiki>_</nowiki> | 8 | <nowiki>S.MODIFIER</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | elmúlt | elmúlt | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 4 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | hiçbir | hiçbir | Det | Det | <nowiki>_</nowiki> | 3 | DETERMINER | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | nyolc | nyolc | M | Mc | <nowiki>n=singular|case=nominative</nowiki> | 4 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | şey | şey | Noun | Noun | <nowiki>A3sg|Pnon|Nom</nowiki> | 4 | OBJECT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | hónapban | hónap | N | Nc | <nowiki>n=singular|case=inessive|proper=no</nowiki> | 16 | INE | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | söylemedim | söyle | Verb | Verb | <nowiki>Neg|Past|A1sg</nowiki> | 8 | SENTENCE | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 5 | ki | ki | Conj | Conj | <nowiki>_</nowiki> | 4 | INTENSIFIER | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | amelyből | amely | P | Pr | <nowiki>p=3rd|n=singular|case=elative</nowiki> | 11 | ELA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 6 | ben | ben | Pron | PersP | <nowiki>A1sg|Pnon|Nom</nowiki> | 4 | SUBJECT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | összesen | összesen | R | Rx | <nowiki>_</nowiki> | 8 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 7 | sizlere | siz | Pron | PersP | <nowiki>A2pl|Pnon|Dat</nowiki> | 4 | <nowiki>DATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | hatot | hat | M | Mc | <nowiki>n=singular|case=accusative</nowiki> | 11 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 8 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Punc | Punc | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | kényszerűségből | kényszerűség | N | Nc | <nowiki>n=singular|case=elative|proper=no</nowiki> | 11 | ELA | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 10 | szabadságon | szabadság | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 11 | SUP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 11 | töltött | tölt | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=no</nowiki> | 16 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 12 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 14 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 13 | parlamenti | parlamenti | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 14 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 14 | ellenzék | ellenzék | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 11 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 15 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 16 | megváltozott | megváltozik | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=no</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 17 | itthon | itthon | R | Rx | <nowiki>_</nowiki> | 16 | LOCY | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 18 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 19 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 19 | hatalommegosztás | hatalommegosztás | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 22 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 20 | <nowiki>1990-ben</nowiki> | 1990 | M | Mc | <nowiki>n=singular|case=inessive</nowiki> | 21 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 21 | kialakított | kialakított | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 22 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 22 | rendszere | rendszer | N | Nc | <nowiki>n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 16 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 23 | <nowiki>:</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 24 | az | az | T | Tf | <nowiki>def=yes</nowiki> | 26 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 25 | e | e | P | Pd | <nowiki>p=3rd|n=singular|case=nominative</nowiki> | 26 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 26 | héten | hét | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 27 | audienciát | audiencia | N | Nc | <nowiki>n=singular|case=accusative|proper=no</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 28 | tartó | tartó | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 29 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 29 | kormányfő | kormányfő | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 31 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 30 | gyakorlatilag | gyakorlati | A | Af | <nowiki>deg=positive|n=singular|case=essive</nowiki> | 31 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 31 | kivonta | kivon | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=yes</nowiki> | 16 | CP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 32 | magát | maga | P | Px | <nowiki>p=3rd|n=singular|case=accusative</nowiki> | 31 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 33 | az | az | T | Tf | <nowiki>def=yes</nowiki> | 34 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 34 | Országgyűlés | Országgyűlés | N | Np | <nowiki>n=singular|case=nominative|proper=yes</nowiki> | 35 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 35 | ellenőrzése | ellenőrzés | N | Nc | <nowiki>n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 36 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 36 | alól | alól | S | St | <nowiki>_</nowiki> | 31 | PP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 37 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | SPUNCT | SPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |
The first sentence of the CoNLL 2007 test data: | The first sentence of the CoNLL 2007 test data: |
| |
| 1 | A | a | T | Tf | <nowiki>def=yes</nowiki> | 2 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 1 | <nowiki>_</nowiki> | ötele | Verb | Verb | Pos | 2 | DERIV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | bankokkal | bank | N | Nc | <nowiki>n=plural|case=instrumental|proper=no</nowiki> | 4 | INS | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 2 | Öteleme | <nowiki>_</nowiki> | Noun | NInf | <nowiki>A3sg|Pnon|Nom</nowiki> | 3 | CLASSIFIER | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | kell | kell | V | Vm | <nowiki>mood=indicative|t=present|p=3rd|n=singular|def=no</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 3 | işleminde | işlem | Noun | Noun | <nowiki>A3sg|P3sg|Loc</nowiki> | 10 | <nowiki>LOCATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | egyezkedniük | egyezkedik | V | Vm | <nowiki>mood=infinitive|t=present|p=3rd|n=plural</nowiki> | 3 | INF | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 4 | kuyrukta | kuyruk | Noun | Noun | <nowiki>A3sg|Pnon|Loc</nowiki> | 5 | <nowiki>LOCATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | azoknak | az | P | Pd | <nowiki>p=3rd|n=plural|case=dative</nowiki> | 8 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 5 | <nowiki>_</nowiki> | bekle | Verb | Verb | Pos | 6 | DERIV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 8 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 6 | bekleyen | <nowiki>_</nowiki> | Adj | APresPart | <nowiki>_</nowiki> | 7 | MODIFIER | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | mezőgazdasági | mezőgazdasági | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 8 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 7 | eleman | eleman | Noun | Noun | <nowiki>A3sg|Pnon|Nom</nowiki> | 10 | SUBJECT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | termelőknek | termelő | N | Nc | <nowiki>n=plural|case=dative|proper=no</nowiki> | 4 | DAT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 8 | yığına | yığın | Noun | Noun | <nowiki>A3sg|Pnon|Dat</nowiki> | 10 | <nowiki>DATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 3 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 9 | <nowiki>_</nowiki> | it | Verb | Verb | <nowiki>_</nowiki> | 10 | DERIV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | akik | aki | P | Pr | <nowiki>p=3rd|n=plural|case=nominative</nowiki> | 21 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 10 | itilir | <nowiki>_</nowiki> | Verb | Verb | <nowiki>Pass|Pos|Aor|A3sg</nowiki> | 11 | SENTENCE | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | egy | egy | T | Ti | <nowiki>def=no</nowiki> | 19 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | | 11 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Punc | Punc | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 19 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 13 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 15 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 14 | múlt | múlt | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 15 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 15 | héten | hét | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 16 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 16 | megjelent | megjelent | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 19 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 17 | földművelésügyi | földművelésügyi | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 18 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 18 | minisztériumi | minisztériumi | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 19 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 19 | rendelet | rendelet | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 20 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 20 | alapján | alap | N | Nc | <nowiki>n=singular|case=superessive|proper=no|pperson=3rd|pnumber=singular</nowiki> | 21 | SUP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 21 | kérik | kér | V | Vm | <nowiki>mood=indicative|t=present|p=3rd|n=plural|def=yes</nowiki> | 5 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 22 | ősszel | ősszel | R | Rx | <nowiki>_</nowiki> | 23 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 23 | lejáró | lejáró | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 27 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 24 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 27 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 25 | éven | év | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 26 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 26 | belüli | belüli | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 27 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 27 | hiteleik | hitel | N | Nc | <nowiki>n=plural|case=nominative|proper=no|pperson=3rd|pnumber=plural</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 28 | átütemezését | átütemezés | N | Nc | <nowiki>n=singular|case=accusative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 21 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| 29 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | SPUNCT | SPUNCT | <nowiki>_</nowiki> | 3 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> | | |
| |
==== Parsing ==== | ==== Parsing ==== |