[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

user:zeman:treebanks:el [2011/11/20 19:41] (current)
zeman vytvořeno
Line 1: Line 1:
 +===== Greek (el) =====
 +
 +Greek Dependency Treebank (GDT)
 +
 +==== Versions ====
 +
 +  * CoNLL 2007
 +
 +==== Obtaining and License ====
 +
 +There does not seem to be any regular distribution channel for the Greek Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Prokopis Prokopidis (prokopis (at) ilsp (dot) gr) about availability of the corpus.
 +
 +GDT was created by members of the [[http://​www.ilsp.gr/​|Institute for Language and Speech Processing]] (Ινστιτούτο Επεξεργασίας του Λόγου, ILSP/​ΙΕΛ),​ Επιδαύρου & Αρτέμιδος 6, Παράδεισος Αμαρουσίου,​ GR-15125 Αθήνα, Greece.
 +
 +==== References ====
 +
 +  * Website
 +    * //no website dedicated to the treebank//
 +  * Data
 +    * //no separate citation//
 +  * Principal publications
 +    * Prokopis Prokopidis, Elina Desipri, Maria Koutsombogera,​ Harris Papageorgiou,​ Stelios Piperidis: [[http://​www.ilsp.gr/​homepages/​prokopidis/​documents/​gdt_tlt2005.pdf|Theoretical and Practical Issues in the Construction of a Greek Dependency Corpus]] In: Montserrat Civit, Sandra Kübler, Ma. Antònia Martí (eds.), Proceedings of The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), pp. 149-160, Barcelona, Spain, 2005.
 +  * Documentation
 +    * Description of tags and feature values is provided in the ''​doc/​README''​ file in the CoNLL 2007 data distribution.
 +
 +==== Domain ====
 +
 +Mixed (“GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel.”)
 +
 +==== Size ====
 +
 +The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences training, 4804 tokens / 197 sentences test).
 +
 +==== Inside ====
 +
 +The syntactic annotation style and the tagset for dependency relations (analytical functions) in GDT has been modeled after the [[http://​ufal.mff.cuni.cz/​pdt2.0/​doc/​manuals/​en/​a-layer/​html/​index.html|Prague Dependency Treebank]].
 +
 +==== Sample ====
 +
 +The first sentence of the CoNLL 2007 training data:
 +
 +| 1 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ |
 +| 2 | Τα | ο | At | AtDf | Ne<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Nm | 3 | Atr | _ | _ |
 +| 3 | αντισώματα | αντίσωμα | No | NoCm | Ne<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Nm | 5 | Sb | _ | _ |
 +| 4 | IgG | IgG | Rg | RgFwOr | _ | 3 | Atr | _ | _ |
 +| 5 | είναι | είμαι | Vb | VbMn | Id<​nowiki>​|</​nowiki>​Pr<​nowiki>​|</​nowiki>​03<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Xx<​nowiki>​|</​nowiki>​Ip<​nowiki>​|</​nowiki>​Pv<​nowiki>​|</​nowiki>​Xx | 10 | Obj_Co | _ | _ |
 +| 6 | σαν | σαν | Ad | Ad | Ba | 5 | Adv | _ | _ |
 +| 7 | μακροπρόθεσμη | μακροπρόθεσμος | Aj | Aj | Ba<​nowiki>​|</​nowiki>​Fe<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 8 | Atr | _ | _ |
 +| 8 | μνήμη | μνήμη | No | NoCm | Fe<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 6 | Adv | _ | _ |
 +| 9 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ |
 +| 10 | ενώ | ενώ | Cj | CjCo | _ | 26 | Coord | _ | _ |
 +| 11 | το | ο | At | AtDf | Ne<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 12 | Atr | _ | _ |
 +| 12 | IgA | IgA | Rg | RgFwOr | _ | 15 | Sb | _ | _ |
 +| 13 | πιστεύεται | πιστεύεται | Vb | VbMn | Id<​nowiki>​|</​nowiki>​Pr<​nowiki>​|</​nowiki>​03<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Xx<​nowiki>​|</​nowiki>​Ip<​nowiki>​|</​nowiki>​Pv<​nowiki>​|</​nowiki>​Xx | 10 | Obj_Co | _ | _ |
 +| 14 | ότι | ότι | Cj | CjSb | _ | 13 | AuxC | _ | _ |
 +| 15 | είναι | είμαι | Vb | VbMn | Id<​nowiki>​|</​nowiki>​Pr<​nowiki>​|</​nowiki>​03<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Xx<​nowiki>​|</​nowiki>​Ip<​nowiki>​|</​nowiki>​Pv<​nowiki>​|</​nowiki>​Xx | 14 | Sb | _ | _ |
 +| 16 | ένας | ένας | At | AtId | Ma<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 18 | Atr | _ | _ |
 +| 17 | συγκεκριμένος | συγκεκριμένος | Aj | Aj | Ba<​nowiki>​|</​nowiki>​Ma<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 18 | Atr | _ | _ |
 +| 18 | δείκτης | δείκτης | No | NoCm | Ma<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 15 | Pnom | _ | _ |
 +| 19 | για | για | AsPp | AsPpSp | _ | 18 | AuxP | _ | _ |
 +| 20 | πρόσφατες | πρόσφατος | Aj | Aj | Ba<​nowiki>​|</​nowiki>​Fe<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ac | 21 | Atr_Co | _ | _ |
 +| 21 | ή | ή | Cj | CjCo | _ | 23 | Coord | _ | _ |
 +| 22 | χρόνιες | χρόνιος | Aj | Aj | Ba<​nowiki>​|</​nowiki>​Fe<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ac | 21 | Atr_Co | _ | _ |
 +| 23 | λοιμώξεις | λοίμωξη | No | NoCm | Fe<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ac | 19 | Atr | _ | _ |
 +| 24 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ |
 +| 25 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ |
 +| 26 | εξηγεί | εξηγώ | Vb | VbMn | Id<​nowiki>​|</​nowiki>​Pr<​nowiki>​|</​nowiki>​03<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Xx<​nowiki>​|</​nowiki>​Ip<​nowiki>​|</​nowiki>​Av<​nowiki>​|</​nowiki>​Xx | 0 | Pred | _ | _ |
 +| 27 | η | ο | At | AtDf | Fe<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 28 | Atr | _ | _ |
 +| 28 | Δρ | Δρ | Rg | RgFwTr | _ | 26 | Sb | _ | _ |
 +| 29 | Αρκάρι | Αρκάρι | No | NoCm | Ne<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 28 | Atr | _ | _ |
 +| 30 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ |
 +
 +The first sentence of the CoNLL 2007 test data:
 +
 +| 1 | Η | ο | At | AtDf | Fe<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 2 | Atr | _ | _ |
 +| 2 | Σίφνος | Σίφνος | No | NoPr | Fe<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Nm | 3 | Sb | _ | _ |
 +| 3 | φημίζεται | φημίζομαι | Vb | VbMn | Id<​nowiki>​|</​nowiki>​Pr<​nowiki>​|</​nowiki>​03<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Xx<​nowiki>​|</​nowiki>​Ip<​nowiki>​|</​nowiki>​Pv<​nowiki>​|</​nowiki>​Xx | 0 | Pred | _ | _ |
 +| 4 | και | και | Cj | CjCo | _ | 5 | AuxY | _ | _ |
 +| 5 | για | για | AsPp | AsPpSp | _ | 3 | AuxP | _ | _ |
 +| 6 | τα | ο | At | AtDf | Ne<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ac | 8 | Atr | _ | _ |
 +| 7 | καταγάλανα | καταγάλανος | Aj | Aj | Ba<​nowiki>​|</​nowiki>​Ne<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ac | 8 | Atr | _ | _ |
 +| 8 | νερά | νερό | No | NoCm | Ne<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ac | 5 | Obj | _ | _ |
 +| 9 | των | ο | At | AtDf | Fe<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ge | 11 | Atr | _ | _ |
 +| 10 | πανέμορφων | πανέμορφος | Aj | Aj | Ba<​nowiki>​|</​nowiki>​Fe<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ge | 11 | Atr | _ | _ |
 +| 11 | ακτών | ακτή | No | NoCm | Fe<​nowiki>​|</​nowiki>​Pl<​nowiki>​|</​nowiki>​Ge | 8 | Atr | _ | _ |
 +| 12 | της | μου | Pn | PnPo | Fe<​nowiki>​|</​nowiki>​03<​nowiki>​|</​nowiki>​Sg<​nowiki>​|</​nowiki>​Ge<​nowiki>​|</​nowiki>​Xx | 11 | Atr | _ | _ |
 +| 13 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ |
 +
 +==== Parsing ====
 +
 +Nonprojectivities in GDT are not frequent. Only 823 of the 70223 tokens in the CoNLL 2007 version are attached nonprojectively (1.17%).
 +
 +The results of the CoNLL 2007 shared task are [[http://​nextens.uvt.nl/​depparse-wiki/​AllScores|available online]]. They have been published in [[http://​aclweb.org/​anthology-new/​D/​D07/​D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:
 +
 +^ Parser (Authors) ^ LAS ^ UAS ^
 +| Nakagawa | 76.31 | 84.08 |
 +| Keith Hall et al. | 74.21 | 82.04 |
 +| Carreras | 73.56 | 81.37 |
 +| Malt (Nilsson et al.) | 74.65 | 81.22 |
 +| Titov et al. | 73.52 | 81.20 |
 +| Chen | 74.42 | 81.16 |
 +| Duan | 74.29 | 80.77 |
 +| Attardi et al. | 73.92 | 80.75 |
 +| Malt (J. Hall et al.) | 74.21 | 80.66 |
 +
 +The two Malt parser results of 2007 (single malt and blended) are described in [[http://​aclweb.org/​anthology-new/​D/​D07/​D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://​w3.msi.vxu.se/​users/​jha/​conll07/​|here]].
  

[ Back to the navigation ] [ Back to the content ]