Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
user:zeman:treebanks [2011/11/03 11:16] zeman Parsing results. |
user:zeman:treebanks [2014/07/17 17:43] (current) zeman Croatian. |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Treebanks for Various Languages ====== | ====== Treebanks for Various Languages ====== | ||
| - | ===== Arabic (ar) ===== | + | http:// |
| - | Prague | + | * [[user: |
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| + | * [[user: | ||
| - | ==== Versions | + | ===== To Process ===== |
| - | * Original PADT 1.0 as distributed by the LDC | + | Ahoj, |
| - | * CoNLL 2006 | + | stáhl jsem nový španělský závislostní korpus IULA (větší než AnCora) |
| - | * CoNLL 2007 | + | / |
| - | The CoNLL 2007 version reportedly improves over CoNLL 2006 in quality of morphological annotation. Both CoNLL versions miss important parts of the original PADT annotation | + | License: |
| + | Web: http:// | ||
| + | Doc: http:// | ||
| + | Download: http:// | ||
| + | Parsing: | ||
| + | state-of-the-art LAS score is 94.7 using Mate-C | ||
| + | sentences | ||
| + | tokens | ||
| - | ==== Obtaining and License ==== | + | The sentences have been choosed |
| - | + | ||
| - | The original PADT 1.0 is distributed by the LDC under the catalogue number [[http:// | + | |
| - | + | ||
| - | * non-commercial research usage | + | |
| - | * no redistribution | + | |
| - | * cite [[http:// | + | |
| - | + | ||
| - | The CoNLL 2006 and 2007 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared. | + | |
| - | + | ||
| - | PADT was created by members of the [[http:// | + | |
| - | + | ||
| - | ==== Domain ==== | + | |
| - | + | ||
| - | Newswire text from press agencies (Agence France Presse, Ummah, Al Hayat, An Nahar, Xinhua 2001-2003). | + | |
| - | + | ||
| - | ==== Size ==== | + | |
| - | + | ||
| - | According to their website, | + | |
| - | + | ||
| - | ==== References ==== | + | |
| - | + | ||
| - | * Website | + | |
| - | * http:// | + | |
| - | * Data | + | |
| - | * Jan Hajič, Otakar Smrž, Petr Zemánek, Petr Pajas, Jan Šnaidauf, Emanuel Beška, Jakub Kráčmar, Kamila Hassanová: //Prague Arabic Dependency Treebank 1.0// (LDC2004T23). Linguistic Data Consortium, Philadelphia, | + | |
| - | * Principal publications | + | |
| - | * Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, Emanuel Beška: [[http:// | + | |
| - | + | ||
| - | ==== Inside ==== | + | |
| - | + | ||
| - | The original PADT 1.0 is distributed in the [[: | + | |
| - | + | ||
| - | Word forms and lemmas are vocalized, i.e. they contain diacritics for short vowels as well as consonant gemination and a few other things. The CoNLL 2006 version includes [[http:// | + | |
| - | + | ||
| - | Note that tokenization of Arabic typically includes splitting original words (inserting spaces between letters), not just separating punctuation from words. Example: وبالفالوجة = wabiālfālūjah = wa/CONJ + bi/PREP + AlfAlwjp/ | + | |
| - | + | ||
| - | The original PADT 1.0 uses 10-character positional morphological tags whose documentation is hard to find. The CoNLL 2006 version converts the tags to the three CoNLL columns, CPOS, POS and FEAT, most of the information | + | |
| - | + | ||
| - | The guidelines for syntactic | + | |
| - | + | ||
| - | ==== Sample ==== | + | |
| - | + | ||
| - | The first two sentences of the CoNLL 2006 training data: | + | |
| - | + | ||
| - | | 1 | غِيابُ_giyAbu | غِياب_giyAb | N | N | case=1< | + | |
| - | | 2 | فُؤاد_fu& | + | |
| - | | 3 | كَنْعان_kanoEAn | كَنْعان_kanoEAn | Z | Z | _ | 1 | Atr | _ | _ | | + | |
| - | | |||||||||| | + | |
| - | | 1 | فُؤاد_fu& | + | |
| - | | 2 | كَنْعان_kanoEAn | كَنْعان_kanoEAn | Z | Z | _ | 9 | Sb | _ | _ | | + | |
| - | | 3 | ،_, | ،_, | G | G | _ | 2 | AuxG | _ | _ | | + | |
| - | | 4 | رائِد_rA}id | رائِد_rA}id | N | N | _ | 2 | Atr | _ | _ | | + | |
| - | | 5 | القِصَّة_AlqiS~ap | قِصَّة_qiS~ap | N | N | gen=F< | + | |
| - | | 6 | القَصِيرَةِ_AlqaSiyrapi | قَصِير_qaSiyr | A | A | gen=F< | + | |
| - | | 7 | فِي_fiy | فِي_fiy | P | P | _ | 4 | AuxP | _ | _ | | + | |
| - | | 8 | لُبْنانِ_lubonAni | لُبْنان_lubonAn | Z | Z | case=2< | + | |
| - | | 9 | رَحَلَ_raHala | رَحَل-َ_raHal-a | V | VP | pers=3< | + | |
| - | | 10 | مَساءَ_masA' | + | |
| - | | 11 | أَمْسِ_> | + | |
| - | | 12 | عَن_Ean | عَن_Ean | P | P | _ | 9 | AuxP | _ | _ | | + | |
| - | | 13 | 81_81 | 81_81 | Q | Q | _ | 12 | Adv | _ | _ | | + | |
| - | | 14 | عاماً_EAmAF | عام_EAm | N | N | gen=M< | + | |
| - | | 15 | ._. | ._. | G | G | _ | 0 | AuxK | _ | _ | | + | |
| - | + | ||
| - | The first sentence of the CoNLL 2006 test data: | + | |
| - | + | ||
| - | | 1 | اِتِّفاقٌ_Ait~ifAqN | اِتِّفاق_Ait~ifAq | N | N | case=1< | + | |
| - | | 2 | بَيْنَ_bayona | بَيْنَ_bayona | P | P | _ | 1 | AuxP | _ | _ | | + | |
| - | | 3 | لُبْنانِ_lubonAni | لُبْنان_lubonAn | Z | Z | case=2< | + | |
| - | | 4 | وَ_wa | وَ_wa | C | C | _ | 2 | Coord | _ | _ | | + | |
| - | | 5 | سُورِيَّةٍ_suwriy~apK | سُورِيا_suwriyA | Z | Z | gen=F< | + | |
| - | | 6 | عَلَى_EalaY | عَلَى_EalaY | P | P | _ | 1 | AuxP | _ | _ | | + | |
| - | | 7 | رَفْعِ_rafoEi | رَفْع_rafoE | N | N | case=2< | + | |
| - | | 8 | مُسْتَوَى_musotawaY | مُسْتَوَى_musotawaY | N | N | _ | 7 | Atr | _ | _ | | + | |
| - | | 9 | التَبادُلِ_AltabAduli | تَبادُل_tabAdul | N | N | case=2< | + | |
| - | | 10 | التِجارِيِّ_AltijAriy~i | تِجارِيّ_tijAriy~ | A | A | case=2< | + | |
| - | | 11 | إِلَى_< | + | |
| - | | 12 | 500_500 | 500_500 | Q | Q | _ | 11 | Atr | _ | _ | | + | |
| - | | 13 | مِلْيُونِ_miloyuwni | مِلْيُون_miloyuwn | N | N | case=2< | + | |
| - | | 14 | دُولارٍ_duwlArK | دُولار_duwlAr | N | N | case=2< | + | |
| - | + | ||
| - | The first sentence of the CoNLL 2007 training data: | + | |
| - | + | ||
| - | | 1 | تَعْدادُ | تَعْداد_1 | N | N- | Case=1< | + | |
| - | | 2 | سُكّانِ | ساكِن_1 | N | N- | Case=2< | + | |
| - | | 3 | 22 | [DEFAULT] | Q | Q- | _ | 2 | Atr | _ | _ | | + | |
| - | | 4 | دَوْلَةً | دَوْلَة_1 | N | N- | Gender=F< | + | |
| - | | 5 | عَرَبِيَّةً | عَرَبِيّ_1 | A | A- | Gender=F< | + | |
| - | | 6 | سَ | سَ_FUT | F | F- | _ | 7 | AuxM | _ | _ | | + | |
| - | | 7 | يَرْتَفِعُ | اِرْتَفَع_1 | V | VI | Mood=I< | + | |
| - | | 8 | إِلَى | إِلَى_1 | P | P- | _ | 7 | AuxP | _ | _ | | + | |
| - | | 9 | 654 | [DEFAULT] | Q | Q- | _ | 8 | Adv | _ | _ | | + | |
| - | | 10 | مِلْيُونَ | مِلْيُون_1 | N | N- | Case=4< | + | |
| - | | 11 | نَسَمَةٍ | نَسَمَة_1 | N | N- | Gender=F< | + | |
| - | | 12 | فِي | فِي_1 | P | P- | _ | 7 | AuxP | _ | _ | | + | |
| - | | 13 | مُنْتَصَفِ | مُنْتَصَف_1 | N | N- | Case=2< | + | |
| - | | 14 | القَرْنِ | قَرْن_1 | N | N- | Case=2< | + | |
| - | + | ||
| - | The first sentence of the CoNLL 2007 test data: | + | |
| - | + | ||
| - | | 1 | مُقاوَمَةُ | مُقاوَمَة_1 | N | N- | Gender=F< | + | |
| - | | 2 | زَواجِ | زَواج_1 | N | N- | Case=2< | + | |
| - | | 3 | الطُلّابِ | طالِب_1 | N | N- | Case=2< | + | |
| - | | 4 | العُرْفِيِّ | عُرْفِيّ_1 | A | A- | Case=2< | + | |
| - | + | ||
| - | ==== Parsing ==== | + | |
| - | + | ||
| - | Nonprojectivities in PADT are rare. Only 431 of the 116,793 tokens in the CoNLL 2007 version are attached nonprojectively (0.37%). | + | |
| - | + | ||
| - | The results of the CoNLL 2006 shared task are [[http:// | + | |
| - | + | ||
| - | | Parser (Authors) | LAS | UAS | | + | |
| - | | MST (McDonald et al.) | 66.91 | 79.34 | | + | |
| - | | Basis (O' | + | |
| - | | Malt (Nivre et al.) | 66.71 | 77.52 | | + | |
| - | | Edinburgh (Riedel et al.) | 66.65 | 78.62 | | + | |
| - | + | ||
| - | The results of the CoNLL 2007 shared task are [[http:// | + | |
| - | + | ||
| - | | Parser (Authors) | LAS | UAS | | + | |
| - | | Malt (Nilsson et al.) | 76.52 | 85.81 | | + | |
| - | | Nakagawa | 75.08 | 86.09 | | + | |
| - | | Malt (Hall et al.) | 74.75 | 84.21 | | + | |
| - | | Sagae | 74.71 | 84.04 | | + | |
| - | | Chen | 74.65 | 83.49 | | + | |
| - | | Titov et al. | 74.12 | 83.18 | | + | |
| + | Martin | ||
