Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2008/03/31 22:14] zeman de::conll |
user:zeman:interset:drivers [2009/09/08 18:16] zeman |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Tag Set Drivers ====== | ====== Tag Set Drivers ====== | ||
- | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | + | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. I also try to keep track of the work time needed for particular drivers because the original motivation behind DZ Interset was to save time and effort. |
+ | |||
+ | ===== Arabic (ar) ===== | ||
+ | |||
+ | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
+ | |||
+ | Created in 2006-2007. | ||
+ | Total work time: 13 hours | ||
+ | |||
+ | ===== Bulgarian (bg) ===== | ||
+ | |||
+ | The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals. | ||
+ | |||
+ | Created in 2007. | ||
+ | Total work time: 35 hours | ||
+ | |||
+ | The main reasons why the implementation took so long: | ||
+ | * Necessity to re-work the system of main word classes, especially pronouns. | ||
+ | * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa). | ||
+ | * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense). | ||
+ | * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, | ||
===== Chinese (zh) ===== | ===== Chinese (zh) ===== | ||
Line 27: | Line 47: | ||
České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | České značky PDT (přes 4000 značek; jádro Intersetu vzniklo jako vedlejší produkt, když jsem dělal tohle) asi 2 dny, tedy dejme tomu 18 hodin. Dalších 11:09 hodin jsem spotřeboval, | ||
- | ==== CoNLL (derived from PDT) ==== | + | ==== CoNLL 2006 ==== |
The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | The CoNLL 2006 and 2007 Czech treebanks are data from PDT converted to the CoNLL format. The PDT morphological tags have been decomposed into coarse-grained part of speech, detailed part of speech, and a set of feature values. All PDT tags have unique equivalents in CoNLL. However, the mapping to the original PDT tags is not one-to-one. Some information, | ||
Line 38: | Line 58: | ||
More than half of the time was consumed during testing for tuning tags containing the Sem feature. | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | The [[: | ||
+ | |||
+ | The '' | ||
+ | |||
+ | Work started: 24.3.2009 | ||
+ | Work finished: 24.3.2009 | ||
+ | Total work time: 1:10 h | ||
+ | |||
+ | ==== Multext ==== | ||
+ | |||
+ | The tagset of the MULTEXT-EAST project and corpora. The file '' | ||
+ | |||
+ | Work started: 16.2.2009 | ||
+ | Work finished: 18.2.2009 | ||
+ | Total work time: 16:36 h | ||
+ | |||
+ | Czech tagsets are notoriously complex. This one maps quite nicely to DZ Interset features. However, the few distinctions that are not (yet) represented in DZ Interset made debugging difficult. Clitic_s and generic numerals represented using the '' | ||
+ | |||
+ | ===== Danish (da) ===== | ||
+ | |||
+ | Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish. | ||
+ | |||
+ | Total work time: about 7 hours | ||
+ | |||
+ | ===== English (en) ===== | ||
+ | |||
+ | ==== Penn Treebank Tagset ==== | ||
+ | |||
+ | Penn Treebank (45 atomic tags). Detailed classification of punctuation. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== CoNLL 2006 ==== | ||
+ | |||
+ | The driver is just an envelope around the '' | ||
+ | |||
+ | Total work time: 48 minutes | ||
+ | |||
+ | ==== CoNLL 2009 ==== | ||
+ | |||
+ | Another envelope around the '' | ||
+ | |||
+ | Work started: 25.3.2009 | ||
+ | Work finished: 25.3.2009 | ||
+ | Total work time: 2:57 h | ||
===== German (de) ===== | ===== German (de) ===== | ||
Line 51: | Line 119: | ||
Total work time: 4:00 h | Total work time: 4:00 h | ||
- | ==== CoNLL (derived from STTS) ==== | + | ==== CoNLL 2006 ==== |
Only simple envelope around the STTS driver needed. | Only simple envelope around the STTS driver needed. | ||
Line 59: | Line 127: | ||
Total work time: 10 min | Total work time: 10 min | ||
- | ===== Time needed for tag set conversion ===== | ||
- | Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače. | + | ==== CoNLL 2009 ==== |
- | Ruský treebank (nejen značky, ale vůbec převod formátu): | + | This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort. |
- | 12:36 | + | |
- | Arabské značky (Otovy i Buckwalterovy, | + | Work started: 5.4.2009 |
- | 4:45+1+1:40 = 7:25 | + | Work finished: 6.4.2009 |
+ | Total work time: 9:39 h | ||
- | Dánské značky DDT/Parole (144 značek s košatým popisem) | ||
- | asi 7 hodin | ||
- | Švédské značky Mamba (48 značek) | ||
- | asi 3 hodiny | ||
- | Penn Treebank | + | ===== Polish |
- | asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad | + | |
- | Hajičovy | + | Based on the [[http:// |
- | 0:32 - tady zjevně chybí úplná statistika | + | |
- | Arabské značky CoNLL | + | Work started: |
- | 4:33+5:19+3:16 = 13:08 | + | Work finished: 8.9.2009 |
+ | Total work time: 9:54 h | ||
- | Bulharské značky CoNLL | + | ===== Portuguese |
- | 0: | + | |
- | (ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené) | + | |
- | Anglické značky | + | The Portuguese |
- | 0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne? | + | |
- | Žádné z výše uvedených | + | http:// |
+ | http:// | ||
+ | |||
+ | Work started: 2.4.2008 | ||
+ | Work finished: 24.4.2008 | ||
+ | Total work time: 28:18 h | ||
+ | |||
+ | The CoNLL version of the Floresta tagset was a real pain. Not only is the tagset complex with many features, some of them strangely overlapping, | ||
+ | |||
+ | | **Feature** | **Explanation** | **Examples** | | ||
+ | | _ | no features | prepositions, | ||
+ | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | ||
+ | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | ||
+ | | 1P | 1st person plural | tomámos, vamos, vemos, víamos | | ||
+ | | 2S | 2nd person singular | compreendeste, | ||
+ | | 2P | 2nd person plural | chamais, vós | | ||
+ | | 3S | 3rd person singular | viu, viva | | ||
+ | | 3S/P | 3rd person singular or plural | se, si | | ||
+ | | 3P | 3rd person plural | vivem | | ||
+ | | ACC | pronoun as direct accusative object | se, te, vos | | ||
+ | | ACC/DAT | pronouns in accusative or dative | nos, se | | ||
+ | | COND | verb in conditional mood | precisariam, | ||
+ | | DAT | pronoun as dative object | lhe, lhes, me, no, nos, se, vos | | ||
+ | | F | feminine | | | ||
+ | | F/M | feminine or masculine | | | ||
+ | | FUT | future tense of verbs | tenderão, tomará, usará | | ||
+ | | IMP | imperative mood of verbs | chega, move, olha, sê | | ||
+ | | IMPF | imperfect tense of verbs | abandonasse, | ||
+ | | IND | indicative mood of verbs | abafaram, abandonam, abate, abateu | | ||
+ | | M | masculine | açúcar, adepto, adiantado | | ||
+ | | M/F | masculine or feminine | Abidjan, cada, Chaves, especial | | ||
+ | | MQP | pluperfect past tense of verbs | acabara, defendera, existira, foram, quisera, viram | | ||
+ | | NOM | personal pronoun in nominative | ela, elas, ele, eles, eu, nós, vocês, você, vós | | ||
+ | | NOM/PIV | personal pronoun in nominative or prepositional object | ela, elas, ele, eles, nós, você | | ||
+ | | P | plural | 0,92, 14h00, africanos, águas, Amigos_da_Ilha_de_Santos | | ||
+ | | PIV | pronoun in prepositional object | ela, elas, ele, eles, mim, nós, si, ti, vós | | ||
+ | | PR | present tense of verbs | abandonam, abate, abonam, abordo, abra | | ||
+ | | PR/PS | present or past tense of verbs | conhecemos, conseguimos, | ||
+ | | PS | perfect past tense of verbs | abalou, abandonaram, | ||
+ | | PS/MQP | perfect or pluperfect past tense of verbs | abafaram, abriram, acabaram, aceitaram | | ||
+ | | S | singular | 1992, adicional, aditamento, aduaneira | | ||
+ | | S/P | singular or plural | capaz, Chaves, mais | | ||
+ | | SUBJ | subjunctive mood of verbs | abandonasse, | ||
+ | | <ALT> | indicates typo in word | | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <SUP> | superlative of adjectives and adverbs | inferior, máximo, melhor, mínimo, ótimo, péssimo, pior | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <dem> | demonstrative pronoun or adverb | este, isso, isto, o, os, tais, tal, tão | | ||
+ | | <det> | determiner usage / inflection of adverb | algo, meio, nada, quase, todo, um_tanto | | ||
+ | | < | ||
+ | | < | ||
+ | | <fmc> | verb heading finite main clause | | | ||
+ | | <foc> | focus marker, adverb or pronoun | é_que, foi, fomos, que, são, será | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <kc> | conjunctional adverb | agora, aí, bem_como, como, ora, tal_como, todavia | | ||
+ | | <ks> | adverb or preposition used like a subordinating conjunction | como, enquanto, onde, quando, segundo | | ||
+ | | <n> | other word class used as noun, typically as head of noun phrase | anglo-americano, | ||
+ | | <poss | possessive determiner pronoun | meu, meus, minha, minhas, nossa, nossas, nosso, nossos, seu, seus, sua | | ||
+ | | < | ||
+ | | <prp> | other word class used as preposition | como, conforme, consoante, embora, segundo | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | <rel> | relative pronoun or adverb | à_medida_que, | ||
+ | | < | ||
+ | | < | ||
+ | | <si> | reflexive usage of 3rd person possessive | seu, seus, sua, suas | | ||
+ | | <eg> | undocumented feature | 2 occurrences with cardinal numbers | | ||
+ | | <Eg> | undocumented feature | occurs with numbers, adjectives and pronouns | | ||
+ | | <Em> | undocumented feature | 6 occurrences with adjectives | | ||
+ | | <Es> | undocumented feature | 3 occurrences with adverbs and prepositions | | ||
+ | | <ink> | undocumented feature of finite verbs | está, havia, pode, tentou | | ||
+ | | < | ||
+ | | < | ||
+ | | N | undocumented feature of nouns and articles | 15 occurrences | | ||
+ | | <new> | undocumented feature | | | ||
+ | | <nil> | undocumented feature | | | ||
+ | | <obj> | undocumented feature | se | | ||
+ | | <p> | undocumented feature | 1 occurrence | | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | < | ||
+ | | > | noise; should be ignored | | | ||
+ | | 0/1/3S | noise; should probably be 1/3S | | | ||
+ | | 1 | noise; should be 1S | aproveitaria, | ||
+ | | 1S> | noise; should be 1S | meu, meus, minha, minhas | | ||
+ | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | ||
+ | | 2S> | noise; should be 2S | seu, teu | | ||
+ | | 2P> | noise; should be 2P | vossa, vosso | | ||
+ | | 3S> | noise; should be 3S | seu, seus, sua, suas | | ||
+ | | 3S/P> | noise; should be 3S/P | seu, seus, sua | | ||
+ | | 3P> | noise; should be 3P | seu, seus, sua | | ||
+ | | <adv> | noise? | fundo | | ||
+ | | < | ||
+ | | < | ||
+ | | > | ||
+ | | < | ||
+ | | convidado-> | ||
+ | | < | ||
+ | | < | ||
+ | | <corr | noise; should be <ALT> | | | ||
+ | | < | ||
+ | | <Eg>F | noise; should be two features | | | ||
+ | | <Eg>M | noise; should be two features | | | ||
+ | | <F | noise; should be F | | | ||
+ | | GER | noise; redundant gerund marker | 1 occurrence with v-ger | | ||
+ | | < | ||
+ | | INF | noise; redundant infinitive marker | 2 occurrences with < | ||
+ | | 'Maio | noise | Maio | | ||
+ | | MVF | noise; should be MV and F | motivada | | ||
+ | | NUM | noise; redundant numeral marker | 1994 | | ||
+ | | pasando> | noise; should be <ALT> | passando | | ||
+ | | PCP | noise; redundant participle marker | 2 occurrences | | ||
+ | | < | ||
+ | | < | ||
+ | | PROP | noise | 2 occurrences | | ||
+ | | < | ||
+ | | < | ||
+ | | R | noise; should be PR | 2 occurrences | | ||
+ | | recohidas> | ||
+ | | < | ||
+ | | s | noise; should be S | | | ||
+ | | saiem> | noise; should be <ALT> | saem | | ||
+ | | < | ||
+ | | < | ||
+ | | <sc> | noise; should be < | ||
+ | | subordinanda> | ||
+ | | V | noise; redundant verb marker | | | ||
+ | | < | ||
+ | | VFIN | noise | há od haver | | ||
+ | |||
+ | ===== Swedish (sv) ===== | ||
+ | |||
+ | ==== Mamba and CoNLL ==== | ||
+ | |||
+ | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== Tags of Hajič' | ||
+ | |||
+ | Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature). | ||
+ | |||
+ | No reliable statistics of work time; estimated 8 hours | ||
+ | |||
+ | ===== Time needed for tag set conversion ===== | ||
+ | |||
+ | Some records about targeted tagset conversion for given tagset pairs, done in early 2006: | ||
+ | |||
+ | Ruský treebank | ||
+ | 12:36 | ||
+ | |||
+ | Arabské značky (Otovy i Buckwalterovy, | ||
+ | 4:45+1+1:40 = 7:25 | ||