Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:treebanks [2011/11/18 18:38] zeman German TIGER-XML sample. |
user:zeman:treebanks [2011/11/19 13:08] zeman Greek sample. |
||
---|---|---|---|
Line 1407: | Line 1407: | ||
It is not clear what the // | It is not clear what the // | ||
+ | |||
+ | The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format. | ||
==== Sample ==== | ==== Sample ==== | ||
Line 1535: | Line 1537: | ||
==== Parsing ==== | ==== Parsing ==== | ||
- | Nonprojectivities in AnCora-CA are very rare. Only 487 of the 435,860 tokens in the CoNLL 2007 version | + | TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets |
- | The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http:// | + | The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http:// |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ | ||
- | | Titov et al. | 87.40 | 93.40 | | + | | MST (McDonald |
- | | Sagae | 88.16 | 93.34 | | + | | Riedel |
- | | Malt (Nilsson | + | | Basis (O' |
- | | Nakagawa | + | | Malt (Nivre et al.) | 85.82 | 88.76 | |
- | | Carreras | 87.60 | 92.46 | | + | |
- | | Malt (Hall et al.) | 87.74 | 92.20 | | + | |
- | The two Malt parser results of 2007 (single malt and blended) are described in [[http:// | + | The results of the CoNLL 2009 shared task are [[http:// |
- | + | ||
- | The results of the CoNLL 2009 shared task are [[http:// | + | |
^ Parser (Authors) ^ LAS ^ | ^ Parser (Authors) ^ LAS ^ | ||
- | | Merlo | 87.86 | | + | | Bohnet | 87.48 | |
- | | Che | 86.56 | | + | | Merlo | 87.29 | |
- | | Bohnet | + | | Chen | 86.24 | |
- | | Chen | 85.88 | | + | | Che | 86.19 | |
+ | |||
+ | ===== Greek (el) ===== | ||
+ | |||
+ | Greek Dependency Treebank (GDT) | ||
+ | |||
+ | ==== Versions ==== | ||
+ | |||
+ | * CoNLL 2007 | ||
+ | |||
+ | ==== Obtaining and License ==== | ||
+ | |||
+ | There does not seem to be any regular distribution channel for the Greek Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Prokopis Prokopidis (prokopis (at) ilsp (dot) gr) about availability of the corpus. | ||
+ | |||
+ | GDT was created by members of the [[http:// | ||
+ | |||
+ | ==== References ==== | ||
+ | |||
+ | * Website | ||
+ | * //no website dedicated to the treebank// | ||
+ | * Data | ||
+ | * //no separate citation// | ||
+ | * Principal publications | ||
+ | * Prokopis Prokopidis, Elina Desipri, Maria Koutsombogera, | ||
+ | * Documentation | ||
+ | * Description of tags and feature values is provided in the '' | ||
+ | |||
+ | ==== Domain ==== | ||
+ | |||
+ | Mixed (“GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel.”) | ||
+ | |||
+ | ==== Size ==== | ||
+ | |||
+ | The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences training, 4804 tokens / 197 sentences test). | ||
+ | |||
+ | ==== Inside ==== | ||
+ | |||
+ | The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http:// | ||
+ | |||
+ | The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. | ||
+ | |||
+ | The guidelines for syntactic annotation are documented in the other [[http:// | ||
+ | |||
+ | ==== Sample ==== | ||
+ | |||
+ | The first sentence of the CoNLL 2007 training data: | ||
+ | |||
+ | | 1 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ | | ||
+ | | 2 | Τα | ο | At | AtDf | Ne< | ||
+ | | 3 | αντισώματα | αντίσωμα | No | NoCm | Ne< | ||
+ | | 4 | IgG | IgG | Rg | RgFwOr | _ | 3 | Atr | _ | _ | | ||
+ | | 5 | είναι | είμαι | Vb | VbMn | Id< | ||
+ | | 6 | σαν | σαν | Ad | Ad | Ba | 5 | Adv | _ | _ | | ||
+ | | 7 | μακροπρόθεσμη | μακροπρόθεσμος | Aj | Aj | Ba< | ||
+ | | 8 | μνήμη | μνήμη | No | NoCm | Fe< | ||
+ | | 9 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ | | ||
+ | | 10 | ενώ | ενώ | Cj | CjCo | _ | 26 | Coord | _ | _ | | ||
+ | | 11 | το | ο | At | AtDf | Ne< | ||
+ | | 12 | IgA | IgA | Rg | RgFwOr | _ | 15 | Sb | _ | _ | | ||
+ | | 13 | πιστεύεται | πιστεύεται | Vb | VbMn | Id< | ||
+ | | 14 | ότι | ότι | Cj | CjSb | _ | 13 | AuxC | _ | _ | | ||
+ | | 15 | είναι | είμαι | Vb | VbMn | Id< | ||
+ | | 16 | ένας | ένας | At | AtId | Ma< | ||
+ | | 17 | συγκεκριμένος | συγκεκριμένος | Aj | Aj | Ba< | ||
+ | | 18 | δείκτης | δείκτης | No | NoCm | Ma< | ||
+ | | 19 | για | για | AsPp | AsPpSp | _ | 18 | AuxP | _ | _ | | ||
+ | | 20 | πρόσφατες | πρόσφατος | Aj | Aj | Ba< | ||
+ | | 21 | ή | ή | Cj | CjCo | _ | 23 | Coord | _ | _ | | ||
+ | | 22 | χρόνιες | χρόνιος | Aj | Aj | Ba< | ||
+ | | 23 | λοιμώξεις | λοίμωξη | No | NoCm | Fe< | ||
+ | | 24 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ | | ||
+ | | 25 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ | | ||
+ | | 26 | εξηγεί | εξηγώ | Vb | VbMn | Id< | ||
+ | | 27 | η | ο | At | AtDf | Fe< | ||
+ | | 28 | Δρ | Δρ | Rg | RgFwTr | _ | 26 | Sb | _ | _ | | ||
+ | | 29 | Αρκάρι | Αρκάρι | No | NoCm | Ne< | ||
+ | | 30 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ | | ||
+ | |||
+ | The first sentence of the CoNLL 2007 test data: | ||
+ | |||
+ | | 1 | Η | ο | At | AtDf | Fe< | ||
+ | | 2 | Σίφνος | Σίφνος | No | NoPr | Fe< | ||
+ | | 3 | φημίζεται | φημίζομαι | Vb | VbMn | Id< | ||
+ | | 4 | και | και | Cj | CjCo | _ | 5 | AuxY | _ | _ | | ||
+ | | 5 | για | για | AsPp | AsPpSp | _ | 3 | AuxP | _ | _ | | ||
+ | | 6 | τα | ο | At | AtDf | Ne< | ||
+ | | 7 | καταγάλανα | καταγάλανος | Aj | Aj | Ba< | ||
+ | | 8 | νερά | νερό | No | NoCm | Ne< | ||
+ | | 9 | των | ο | At | AtDf | Fe< | ||
+ | | 10 | πανέμορφων | πανέμορφος | Aj | Aj | Ba< | ||
+ | | 11 | ακτών | ακτή | No | NoCm | Fe< | ||
+ | | 12 | της | μου | Pn | PnPo | Fe< | ||
+ | | 13 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ | | ||
+ | |||
+ | ==== Parsing ==== | ||
+ | |||
+ | Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%). | ||
+ | |||
+ | The results of the CoNLL 2006 shared task are [[http:// | ||
+ | |||
+ | ^ Parser (Authors) ^ LAS ^ UAS ^ | ||
+ | | MST (McDonald et al.) | 87.57 | 92.04 | | ||
+ | | Malt (Nivre et al.) | 87.41 | 91.72 | | ||
+ | | Nara (Yuchang Cheng) | 86.34 | 91.30 | | ||