Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:ta [2012/03/22 10:37] zeman |
user:zeman:treebanks:ta [2012/03/22 11:01] (current) zeman Nonprojectivity and parsing. |
* //no separate citation// | * //no separate citation// |
* Principal publications | * Principal publications |
* Loganathan Ramasamy, Zdeněk Žabokrtský: Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches. In: //Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011) – Volume Part I//, pages 82-95, Tokyo, Japan, 2011, published by Springer Berlin / Heidelberg, ISBN 978-3-642-19399-6. | * Loganathan Ramasamy, Zdeněk Žabokrtský: [[http://www.springerlink.com/content/w18v7621070h51g1/|Tamil Dependency Parsing: Results using Rule Based and Corpus Based Approaches]]. In: //Proceedings of the 12th International Conference on Computational Linguistics and Intelligent Text Processing (CICLing 2011) – Volume Part I//, pages 82-95, Tokyo, Japan, 2011, published by Springer Berlin / Heidelberg, ISBN 978-3-642-19399-6. |
* Loganathan Ramasamy, Zdeněk Žabokrtský: Prague Dependency Style Treebank for Tamil. In: //Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)//, İstanbul, Turkey, 2012 | * Loganathan Ramasamy, Zdeněk Žabokrtský: Prague Dependency Style Treebank for Tamil. In: //Proceedings of the 8th International Conference on Language Resources and Evaluation (LREC 2012)//, İstanbul, Turkey, 2012 |
* Documentation | * Documentation |
* [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/morph_annotation.html|Morphological annotation]] | * [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/morph_annotation.html|Morphological annotation]] |
* [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/dependency_annotation.html|Syntactic annotation]] | * [[http://ufal.mff.cuni.cz/~ramasamy/tamiltb/0.1/dependency_annotation.html|Syntactic annotation]] |
| * Loganathan Ramasamy, Zdeněk Žabokrtský: [[http://ufal.mff.cuni.cz/~ramasamy/papers/2011-TamilTB-TR.pdf|Tamil Dependency Treebank (TamilTB) – 0.1 Annotation Manual]]. Technical Report TR-2011-42, ÚFAL MFF UK, Praha, Czechia, 2011 |
| |
==== Domain ==== | ==== Domain ==== |
==== Sample ==== | ==== Sample ==== |
| |
The first two sentences of the CoNLL 2006 training data: | The first sentence of the CoNLL version of training data: |
| |
| 1 | غِيابُ_giyAbu | غِياب_giyAb | N | N | case=1<nowiki>|</nowiki>def=R | 0 | ExD | _ | _ | | | 1 | cennai | cennai | N | <nowiki>NEN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 2 | AAdjn | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | فُؤاد_fu&Ad | فُؤاد_fu&Ad | Z | Z | _ | 3 | Atr | _ | _ | | | 2 | arukE | arukE | P | <nowiki>PP-------</nowiki> | <nowiki>_</nowiki> | 18 | AuxP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | كَنْعان_kanoEAn | كَنْعان_kanoEAn | Z | Z | _ | 1 | Atr | _ | _ | | | 3 | sri | sri | N | <nowiki>NEN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 4 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |||||||||| | | 4 | perumpuTUril | perumpuTUr | N | <nowiki>NEL-3SN--</nowiki> | <nowiki>Cas=L|Per=3|Num=S|Gen=N</nowiki> | 18 | AAdjn | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 1 | فُؤاد_fu&Ad | فُؤاد_fu&Ad | Z | Z | _ | 2 | Atr | _ | _ | | | 5 | kirIn | kirIn | N | <nowiki>NEN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 6 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | كَنْعان_kanoEAn | كَنْعان_kanoEAn | Z | Z | _ | 9 | Sb | _ | _ | | | 6 | pIltu | pIltu | N | <nowiki>NEN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 11 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | ،_, | ،_, | G | G | _ | 2 | AuxG | _ | _ | | | 7 | <nowiki>(</nowiki> | <nowiki>(</nowiki> | Z | <nowiki>Z:-------</nowiki> | <nowiki>_</nowiki> | 6 | AuxG | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | رائِد_rA}id | رائِد_rA}id | N | N | _ | 2 | Atr | _ | _ | | | 8 | wavIna | wavInam | J | <nowiki>JJ-------</nowiki> | <nowiki>_</nowiki> | 6 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | القِصَّة_AlqiS~ap | قِصَّة_qiS~ap | N | N | gen=F<nowiki>|</nowiki>num=S<nowiki>|</nowiki>def=D | 4 | Atr | _ | _ | | | 9 | <nowiki>)</nowiki> | <nowiki>)</nowiki> | Z | <nowiki>Z:-------</nowiki> | <nowiki>_</nowiki> | 6 | AuxG | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | القَصِيرَةِ_AlqaSiyrapi | قَصِير_qaSiyr | A | A | gen=F<nowiki>|</nowiki>num=S<nowiki>|</nowiki>case=2<nowiki>|</nowiki>def=D | 5 | Atr | _ | _ | | | 10 | vimAna | vimAnam | N | <nowiki>NO--3SN--</nowiki> | <nowiki>Per=3|Num=S|Gen=N</nowiki> | 11 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | فِي_fiy | فِي_fiy | P | P | _ | 4 | AuxP | _ | _ | | | 11 | wilaiyaTTukkukk | wilaiyam | N | <nowiki>NND-3SN--</nowiki> | <nowiki>Cas=D|Per=3|Num=S|Gen=N</nowiki> | 12 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | لُبْنانِ_lubonAni | لُبْنان_lubonAn | Z | Z | case=2<nowiki>|</nowiki>def=R | 7 | Atr | _ | _ | | | 12 | Ana | Aku | T | <nowiki>Tg-------</nowiki> | <nowiki>_</nowiki> | 13 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | رَحَلَ_raHala | رَحَل-َ_raHal-a | V | VP | pers=3<nowiki>|</nowiki>gen=M<nowiki>|</nowiki>num=S | 0 | Pred | _ | _ | | | 13 | wilam | wilam | N | <nowiki>NNN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 18 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | مَساءَ_masA'a | مَساء_masA' | D | D | _ | 9 | Adv | _ | _ | | | 14 | yArukkum | yAr | R | <nowiki>RBD-3SA--</nowiki> | <nowiki>Cas=D|Per=3|Num=S|Gen=A</nowiki> | 15 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | أَمْسِ_>amosi | أَمْسِ_>amosi | D | D | _ | 10 | Atr | _ | _ | | | 15 | pATippu | pATippu | N | <nowiki>NNN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 16 | Comp | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | عَن_Ean | عَن_Ean | P | P | _ | 9 | AuxP | _ | _ | | | 16 | illATa | il | P | <nowiki>PP-------</nowiki> | <nowiki>_</nowiki> | 17 | AuxP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 13 | 81_81 | 81_81 | Q | Q | _ | 12 | Adv | _ | _ | | | 17 | vakaiyil | vakai | N | <nowiki>NNL-3SN--</nowiki> | <nowiki>Cas=L|Per=3|Num=S|Gen=N</nowiki> | 18 | AAdjn | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 14 | عاماً_EAmAF | عام_EAm | N | N | gen=M<nowiki>|</nowiki>num=S<nowiki>|</nowiki>case=4<nowiki>|</nowiki>def=I | 13 | Atr | _ | _ | | | 18 | etukkap | etu | V | <nowiki>Vu-T---AA</nowiki> | <nowiki>Ten=T|Voi=A|Neg=A</nowiki> | 20 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 15 | ._. | ._. | G | G | _ | 0 | AuxK | _ | _ | | | 19 | patum | patu | V | <nowiki>VR-F3SNPA</nowiki> | <nowiki>Ten=F|Per=3|Num=S|Gen=N|Voi=P|Neg=A</nowiki> | 18 | AuxV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 20 | enRu | en | T | <nowiki>Tt-T----A</nowiki> | <nowiki>Ten=T|Neg=A</nowiki> | 23 | AuxC | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 21 | muTalvar | muTalvar | N | <nowiki>NNN-3SH--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=H</nowiki> | 22 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 22 | karuNAwiTi | karuNAwiTi | N | <nowiki>NEN-3SH--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=H</nowiki> | 23 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 23 | uRuTiyaLiTT | uRuTiyaLi | V | <nowiki>Vt-T---AA</nowiki> | <nowiki>Ten=T|Voi=A|Neg=A</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 24 | uLLAr | uL | V | <nowiki>VR-T3SHAA</nowiki> | <nowiki>Ten=T|Per=3|Num=S|Gen=H|Voi=A|Neg=A</nowiki> | 23 | AuxV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| | 25 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Z | <nowiki>Z#-------</nowiki> | <nowiki>_</nowiki> | 0 | AuxK | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| |
The first sentence of the CoNLL 2006 test data: | The first sentence of the CoNLL version of test data: |
| |
| 1 | اِتِّفاقٌ_Ait~ifAqN | اِتِّفاق_Ait~ifAq | N | N | case=1<nowiki>|</nowiki>def=I | 0 | ExD | _ | _ | | | 1 | pikAr | pikAr | N | <nowiki>NEN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 2 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 2 | بَيْنَ_bayona | بَيْنَ_bayona | P | P | _ | 1 | AuxP | _ | _ | | | 2 | iliruwTu | iliruwTu | P | <nowiki>PP-------</nowiki> | <nowiki>_</nowiki> | 4 | AuxP | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 3 | لُبْنانِ_lubonAni | لُبْنان_lubonAn | Z | Z | case=2<nowiki>|</nowiki>def=R | 4 | Atr | _ | _ | | | 3 | ErALamAna | ErALamAna | J | <nowiki>JJ-------</nowiki> | <nowiki>_</nowiki> | 4 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 4 | وَ_wa | وَ_wa | C | C | _ | 2 | Coord | _ | _ | | | 4 | iLainjarkaL | iLainjar | N | <nowiki>NNN-3PA--</nowiki> | <nowiki>Cas=N|Per=3|Num=P|Gen=A</nowiki> | 9 | Sb | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 5 | سُورِيَّةٍ_suwriy~apK | سُورِيا_suwriyA | Z | Z | gen=F<nowiki>|</nowiki>num=S<nowiki>|</nowiki>case=2<nowiki>|</nowiki>def=I | 4 | Atr | _ | _ | | | 5 | vElai | vElai | N | <nowiki>NNN-3SN--</nowiki> | <nowiki>Cas=N|Per=3|Num=S|Gen=N</nowiki> | 6 | Obj | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 6 | عَلَى_EalaY | عَلَى_EalaY | P | P | _ | 1 | AuxP | _ | _ | | | 6 | TEti | TEtu | V | <nowiki>Vt-T---AA</nowiki> | <nowiki>Ten=T|Voi=A|Neg=A</nowiki> | 9 | AAdjn | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 7 | رَفْعِ_rafoEi | رَفْع_rafoE | N | N | case=2<nowiki>|</nowiki>def=R | 6 | Atr | _ | _ | | | 7 | veLi | veLi | J | <nowiki>JJ-------</nowiki> | <nowiki>_</nowiki> | 8 | Atr | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 8 | مُسْتَوَى_musotawaY | مُسْتَوَى_musotawaY | N | N | _ | 7 | Atr | _ | _ | | | 8 | mAwilangkaLukku | mAwilam | N | <nowiki>NND-3PN--</nowiki> | <nowiki>Cas=D|Per=3|Num=P|Gen=N</nowiki> | 9 | AAdjn | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 9 | التَبادُلِ_AltabAduli | تَبادُل_tabAdul | N | N | case=2<nowiki>|</nowiki>def=D | 8 | Atr | _ | _ | | | 9 | kutipeyarwTu | kutipeyar | V | <nowiki>Vt-T---AA</nowiki> | <nowiki>Ten=T|Voi=A|Neg=A</nowiki> | 0 | Pred | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 10 | التِجارِيِّ_AltijAriy~i | تِجارِيّ_tijAriy~ | A | A | case=2<nowiki>|</nowiki>def=D | 9 | Atr | _ | _ | | | 10 | varukinRanar | varu | V | <nowiki>VR-P3PHAA</nowiki> | <nowiki>Ten=P|Per=3|Num=P|Gen=H|Voi=A|Neg=A</nowiki> | 9 | AuxV | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 11 | إِلَى_<ilaY | إِلَى_<ilaY | P | P | _ | 7 | AuxP | _ | _ | | | 11 | <nowiki>.</nowiki> | <nowiki>.</nowiki> | Z | <nowiki>Z#-------</nowiki> | <nowiki>_</nowiki> | 0 | AuxK | <nowiki>_</nowiki> | <nowiki>_</nowiki> | |
| 12 | 500_500 | 500_500 | Q | Q | _ | 11 | Atr | _ | _ | | |
| 13 | مِلْيُونِ_miloyuwni | مِلْيُون_miloyuwn | N | N | case=2<nowiki>|</nowiki>def=R | 12 | Atr | _ | _ | | |
| 14 | دُولارٍ_duwlArK | دُولار_duwlAr | N | N | case=2<nowiki>|</nowiki>def=I | 13 | Atr | _ | _ | | |
| |
The first sentence of the CoNLL 2007 training data: | |
| |
| 1 | تَعْدادُ | تَعْداد_1 | N | N- | Case=1<nowiki>|</nowiki>Defin=R | 7 | Sb | _ | _ | | |
| 2 | سُكّانِ | ساكِن_1 | N | N- | Case=2<nowiki>|</nowiki>Defin=R | 1 | Atr | _ | _ | | |
| 3 | 22 | [DEFAULT] | Q | Q- | _ | 2 | Atr | _ | _ | | |
| 4 | دَوْلَةً | دَوْلَة_1 | N | N- | Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=4<nowiki>|</nowiki>Defin=I | 3 | Atr | _ | _ | | |
| 5 | عَرَبِيَّةً | عَرَبِيّ_1 | A | A- | Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=4<nowiki>|</nowiki>Defin=I | 4 | Atr | _ | _ | | |
| 6 | سَ | سَ_FUT | F | F- | _ | 7 | AuxM | _ | _ | | |
| 7 | يَرْتَفِعُ | اِرْتَفَع_1 | V | VI | Mood=I<nowiki>|</nowiki>Voice=A<nowiki>|</nowiki>Person=3<nowiki>|</nowiki>Gender=M<nowiki>|</nowiki>Number=S | 0 | Pred | _ | _ | | |
| 8 | إِلَى | إِلَى_1 | P | P- | _ | 7 | AuxP | _ | _ | | |
| 9 | 654 | [DEFAULT] | Q | Q- | _ | 8 | Adv | _ | _ | | |
| 10 | مِلْيُونَ | مِلْيُون_1 | N | N- | Case=4<nowiki>|</nowiki>Defin=R | 9 | Atr | _ | _ | | |
| 11 | نَسَمَةٍ | نَسَمَة_1 | N | N- | Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=2<nowiki>|</nowiki>Defin=I | 10 | Atr | _ | _ | | |
| 12 | فِي | فِي_1 | P | P- | _ | 7 | AuxP | _ | _ | | |
| 13 | مُنْتَصَفِ | مُنْتَصَف_1 | N | N- | Case=2<nowiki>|</nowiki>Defin=R | 12 | Adv | _ | _ | | |
| 14 | القَرْنِ | قَرْن_1 | N | N- | Case=2<nowiki>|</nowiki>Defin=D | 13 | Atr | _ | _ | | |
| |
The first sentence of the CoNLL 2007 test data: | |
| |
| 1 | مُقاوَمَةُ | مُقاوَمَة_1 | N | N- | Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=1<nowiki>|</nowiki>Defin=R | 0 | ExD | _ | _ | | |
| 2 | زَواجِ | زَواج_1 | N | N- | Case=2<nowiki>|</nowiki>Defin=R | 1 | Atr | _ | _ | | |
| 3 | الطُلّابِ | طالِب_1 | N | N- | Case=2<nowiki>|</nowiki>Defin=D | 2 | Atr | _ | _ | | |
| 4 | العُرْفِيِّ | عُرْفِيّ_1 | A | A- | Case=2<nowiki>|</nowiki>Defin=D | 2 | Atr | _ | _ | | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in PADT are rare. Only 431 of the 116,793 tokens in the CoNLL 2007 version are attached nonprojectively (0.37%). | Nonprojectivities in PADT are very rare. Only 15 of the 9581 tokens are attached nonprojectively (0.16%). |
| |
The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Arabic: | Initial parsing results were published by [[http://ufal.mff.cuni.cz/~ramasamy/papers/2011-pres-CICLing.pdf|(Ramasamy and Žabokrtský, 2011)]]. They use smaller data and different training-test data split than defined here (2008 tokens training, 953 tokens test). |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | ^ Parser (Authors) ^ LAS ^ UAS ^ |
| MST (McDonald et al.) | 66.91 | 79.34 | | | Malt (Nivre et al.) | 65.69 | 75.03 | |
| Basis (O'Neil) | 66.71 | 78.54 | | | MST (McDonald et al.) | 65.69 | 74.92 | |
| Malt (Nivre et al.) | 66.71 | 77.52 | | |
| Edinburgh (Riedel et al.) | 66.65 | 78.62 | | |
| |
The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Arabic: | |
| |
^ Parser (Authors) ^ LAS ^ UAS ^ | |
| Malt (Nilsson et al.) | 76.52 | 85.81 | | |
| Nakagawa | 75.08 | 86.09 | | |
| Malt (Hall et al.) | 74.75 | 84.21 | | |
| Sagae | 74.71 | 84.04 | | |
| Chen | 74.65 | 83.49 | | |
| Titov et al. | 74.12 | 83.18 | | |
| |
The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. | |
| |