[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:tr [2012/03/22 20:48]
zeman Domain.
user:zeman:treebanks:tr [2012/03/22 21:04]
zeman Inside.
Line 39: Line 39:
 ==== Size ==== ==== Size ====
  
-According to their website, SzTB 2.0 contains 1.2 million words plus 250 thousand punctuation tokens in 82000 sentences. Only a fragment was converted to dependencies in the CoNLL 2007 version: 139,143 tokens in 6424 sentences, yielding 21.66 tokens per sentence on average (131,799 tokens / 6034 sentences training, 7344 tokens / 390 sentences test).+According to their website, the treebank contains 7262 sentences. The CoNLL 2007 version contains 69695 tokens in 5935 sentences, yielding 11.74 tokens per sentence on average (65182 tokens / 5635 sentences training, 4513 tokens / 300 sentences test).
  
 ==== Inside ==== ==== Inside ====
  
-The original Szeged Treebank is a phrase-based treebank and it is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is dependency-based (i.e. the head of each phrase was identified), distributed in the CoNLL 2006/2007 format.+The original METU-Sabanci Treebank is distributed in XML-based, TEI-compliant format. The CoNLL 2007 version is distributed in the [[:format-conll|CoNLL 2006/2007 format]].
  
-Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually. The tagset used in SzTB seems to be same or similar to [[http://nl.ijs.si/ME/V4/msd/html/msd-hu.html|Multext-East]]. In the CoNLL version, tags were decomposed into CPOS column, POS column and the list of feature-value pairs in the FEAT column.+Morphological annotation includes lemmas. Morphosyntactic tags were probably disambiguated manually.
  
-Personal names have been collapsed into one token, using underscore as the joining character (e.g. Torgyán_József).+There are special derivational nodes. Derived words have been split into several tokens (see also the sample below).
  
 ==== Sample ==== ==== Sample ====
Line 53: Line 53:
 The first sentence of the CoNLL 2007 training data: The first sentence of the CoNLL 2007 training data:
  
-| 1 | Az az Tf | <nowiki>def=yes</nowiki> | 4 | DET | <nowiki>_</nowiki><nowiki>_</nowiki>+| 1 | Ama ama Conj Conj | <nowiki>_</nowiki>| <nowiki>S.MODIFIER</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 2 | elmúlt | elmúlt | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 4 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+hiçbir hiçbir Det Det | <nowiki>_</nowiki>DETERMINER | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-nyolc nyolc Mc | <nowiki>n=singular|case=nominative</nowiki> | 4 | ATT | <nowiki>_</nowiki><nowiki>_</nowiki> | +şey şey Noun Noun | <nowiki>A3sg|Pnon|Nom</nowiki>OBJECT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 4 | hónapban | hónap | N | Nc | <nowiki>n=singular|case=inessive|proper=no</nowiki> | 16 | INE | <nowiki>_</nowiki> | <nowiki>_</nowiki>+söylemedim söyle Verb Verb | <nowiki>Neg|Past|A1sg</nowiki> | 8 | SENTENCE | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-<nowiki>,</nowiki> <nowiki>_</nowiki> WPUNCT WPUNCT | <nowiki>_</nowiki> 16 PUNCT | <nowiki>_</nowiki><nowiki>_</nowiki>+ki ki Conj Conj | <nowiki>_</nowiki>INTENSIFIER | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 6 | amelyből | amely | P Pr | <nowiki>p=3rd|n=singular|case=elative</nowiki> | 11 | ELA | <nowiki>_</nowiki> | <nowiki>_</nowiki>+ben ben Pron PersP | <nowiki>A1sg|Pnon|Nom</nowiki>SUBJECT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-összesen összesen Rx | <nowiki>_</nowiki> ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>+sizlere siz Pron PersP | <nowiki>A2pl|Pnon|Dat</nowiki>| <nowiki>DATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 8 | hatot | hat | M | Mc | <nowiki>n=singular|case=accusative</nowiki> | 11 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| <nowiki>.</nowiki> | <nowiki>.</nowiki>Punc Punc | <nowiki>_</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
-kényszerűségből kényszerűség Nc | <nowiki>n=singular|case=elative|proper=no</nowiki> | 11 | ELA | <nowiki>_</nowiki><nowiki>_</nowiki> | +
-| 10 | szabadságon | szabadság | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 11 | SUP | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-11 töltött tölt Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=no</nowiki>16 ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-12 Tf | <nowiki>def=yes</nowiki> | 14 DET <nowiki>_</nowiki><nowiki>_</nowiki>+
-| 13 | parlamenti | parlamenti | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 14 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-14 | ellenzék | ellenzék | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 11 | SUBJ | <nowiki>_</nowiki><nowiki>_</nowiki> | +
-| 15 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 16 | megváltozott | megváltozik | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=no</nowiki> | 0 | ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 17 | itthon | itthon | R | Rx | <nowiki>_</nowiki> | 16 | LOCY | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 18 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 19 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 19 | hatalommegosztás | hatalommegosztás | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 22 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 20 | <nowiki>1990-ben</nowiki> | 1990 | M | Mc | <nowiki>n=singular|case=inessive</nowiki> | 21 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 21 | kialakított | kialakított | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 22 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 22 | rendszere | rendszer | N | Nc | <nowiki>n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 16 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 23 | <nowiki>:</nowiki> | <nowiki>_</nowiki> | WPUNCT | WPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 24 | az | az | T | Tf | <nowiki>def=yes</nowiki> | 26 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 25 | e | e | P | Pd | <nowiki>p=3rd|n=singular|case=nominative</nowiki> | 26 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 26 | héten | hét | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 27 | audienciát | audiencia | N | Nc | <nowiki>n=singular|case=accusative|proper=no</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 28 | tartó | tartó | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 29 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 29 | kormányfő | kormányfő | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 31 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 30 | gyakorlatilag | gyakorlati | A | Af | <nowiki>deg=positive|n=singular|case=essive</nowiki> | 31 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 31 | kivonta | kivon | V | Vm | <nowiki>mood=indicative|t=past|p=3rd|n=singular|def=yes</nowiki> | 16 | CP | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 32 | magát | maga | P | Px | <nowiki>p=3rd|n=singular|case=accusative</nowiki> | 31 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 33 | az | az | T | Tf | <nowiki>def=yes</nowiki> | 34 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 34 | Országgyűlés | Országgyűlés | N | Np | <nowiki>n=singular|case=nominative|proper=yes</nowiki> | 35 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 35 | ellenőrzése | ellenőrzés | N | Nc | <nowiki>n=singular|case=nominative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 36 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 36 | alól | alól | S | St | <nowiki>_</nowiki> | 31 | PP | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 37 | <nowiki>.</nowiki> | <nowiki>_</nowiki> | SPUNCT | SPUNCT | <nowiki>_</nowiki> | 16 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |+
  
 The first sentence of the CoNLL 2007 test data: The first sentence of the CoNLL 2007 test data:
  
-| 1 | A | a | T | Tf | <nowiki>def=yes</nowiki>DET <nowiki>_</nowiki> <nowiki>_</nowiki>+| 1 | <nowiki>_</nowiki>ötele Verb Verb Pos | 2 | DERIV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 2 | bankokkal | bank | N | Nc | <nowiki>n=plural|case=instrumental|proper=no</nowiki> | 4 | INS | <nowiki>_</nowiki> | <nowiki>_</nowiki>+Öteleme | <nowiki>_</nowiki>Noun NInf | <nowiki>A3sg|Pnon|Nom</nowiki> | 3 | CLASSIFIER | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-3 | kell | kell | V Vm | <nowiki>mood=indicative|t=present|p=3rd|n=singular|def=no</nowiki> | 0 | ROOT | <nowiki>_</nowiki><nowiki>_</nowiki> | +işleminde işlem Noun Noun | <nowiki>A3sg|P3sg|Loc</nowiki>10 | <nowiki>LOCATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 4 | egyezkedniük | egyezkedik | V | Vm | <nowiki>mood=infinitive|t=present|p=3rd|n=plural</nowiki> | 3 | INF | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| 4 | kuyrukta kuyruk Noun Noun | <nowiki>A3sg|Pnon|Loc</nowiki>| <nowiki>LOCATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-azoknak az Pd | <nowiki>p=3rd|n=plural|case=dative</nowiki>8 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| <nowiki>_</nowiki>bekle Verb Verb Pos DERIV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 6 | a | a | T | Tf | <nowiki>def=yes</nowiki> | 8 | DET | <nowiki>_</nowiki> | <nowiki>_</nowiki>+bekleyen | <nowiki>_</nowiki>Adj APresPart | <nowiki>_</nowiki>MODIFIER | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 7 | mezőgazdasági | mezőgazdasági | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 8 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+eleman eleman Noun Noun | <nowiki>A3sg|Pnon|Nom</nowiki>10 SUBJECT | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 8 | termelőknek | termelő | N | Nc | <nowiki>n=plural|case=dative|proper=no</nowiki> | 4 | DAT <nowiki>_</nowiki> <nowiki>_</nowiki> | +yığına yığın | Noun Noun | <nowiki>A3sg|Pnon|Dat</nowiki>10 | <nowiki>DATIVE.ADJUNCT</nowiki> | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 9 | <nowiki>,</nowiki> | <nowiki>_</nowiki> | WPUNCT WPUNCT <nowiki>_</nowiki>3 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+| <nowiki>_</nowiki>it Verb Verb | <nowiki>_</nowiki>10 DERIV | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-| 10 | akik | aki | P | Pr | <nowiki>p=3rd|n=plural|case=nominative</nowiki> | 21 | SUBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>+10 itilir | <nowiki>_</nowiki>Verb Verb | <nowiki>Pass|Pos|Aor|A3sg</nowiki>11 SENTENCE | <nowiki>_</nowiki> | <nowiki>_</nowiki>
-11 | egy | egy | T | Ti | <nowiki>def=no</nowiki> | 19 | DET | <nowiki>_</nowiki><nowiki>_</nowiki> | +11 | <nowiki>.</nowiki> | <nowiki>.</nowiki>Punc Punc | <nowiki>_</nowiki>ROOT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |
-12 <nowiki>,</nowiki> <nowiki>_</nowiki> | WPUNCT WPUNCT | <nowiki>_</nowiki> | 19 | PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-13 a | a | T | Tf | <nowiki>def=yes</nowiki> | 15 | DET | <nowiki>_</nowiki><nowiki>_</nowiki>+
-| 14 | múlt | múlt | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 15 ATT | <nowiki>_</nowiki><nowiki>_</nowiki>+
-| 15 | héten | hét | N Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 16 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-16 megjelent megjelent Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki>19 ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-17 földművelésügyi | földművelésügyi | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> 18 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 18 | minisztériumi | minisztériumi | A Af <nowiki>deg=positive|n=singular|case=nominative</nowiki>19 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 19 | rendelet | rendelet | N | Nc | <nowiki>n=singular|case=nominative|proper=no</nowiki> | 20 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-20 | alapján | alap | N | Nc | <nowiki>n=singular|case=superessive|proper=no|pperson=3rd|pnumber=singular</nowiki> | 21 | SUP | <nowiki>_</nowiki><nowiki>_</nowiki> | +
-| 21 | kérik | kér | V | Vm | <nowiki>mood=indicative|t=present|p=3rd|n=plural|def=yes</nowiki> | 5 ATT | <nowiki>_</nowiki><nowiki>_</nowiki> | +
-| 22 | ősszel | ősszel | R | Rx | <nowiki>_</nowiki> | 23 | ADV | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-23 | lejáró | lejáró | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 27 ATT | <nowiki>_</nowiki><nowiki>_</nowiki> | +
-| 24 | <nowiki>,</nowiki> <nowiki>_</nowiki> WPUNCT WPUNCT | <nowiki>_</nowiki>27 PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-25 | éven | év | N | Nc | <nowiki>n=singular|case=superessive|proper=no</nowiki> | 26 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 26 | belüli | belüli | A | Af | <nowiki>deg=positive|n=singular|case=nominative</nowiki> | 27 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 27 | hiteleik | hitel | N | Nc | <nowiki>n=plural|case=nominative|proper=no|pperson=3rd|pnumber=plural</nowiki> | 28 | ATT | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 28 | átütemezését | átütemezés | N | Nc | <nowiki>n=singular|case=accusative|proper=no|pperson=3rd|pnumber=singular</nowiki> | 21 | OBJ | <nowiki>_</nowiki> | <nowiki>_</nowiki>+
-| 29 | <nowiki>.</nowiki><nowiki>_</nowiki> | SPUNCT SPUNCT | <nowiki>_</nowiki>PUNCT | <nowiki>_</nowiki> | <nowiki>_</nowiki> |+
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]