Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2008/03/31 22:14] zeman de::conll |
user:zeman:interset:drivers [2008/04/03 14:49] zeman |
||
---|---|---|---|
Line 2: | Line 2: | ||
This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | ||
+ | |||
+ | ===== Arabic (ar) ===== | ||
+ | |||
+ | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
+ | |||
+ | Created in 2006-2007. | ||
+ | Total work time: 13 hours | ||
+ | |||
+ | ===== Bulgarian (bg) ===== | ||
+ | |||
+ | The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals. | ||
+ | |||
+ | Created in 2007. | ||
+ | Total work time: 35 hours | ||
+ | |||
+ | The main reasons why the implementation took so long: | ||
+ | * Necessity to re-work the system of main word classes, especially pronouns. | ||
+ | * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa). | ||
+ | * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense). | ||
+ | * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, | ||
===== Chinese (zh) ===== | ===== Chinese (zh) ===== | ||
Line 38: | Line 58: | ||
More than half of the time was consumed during testing for tuning tags containing the Sem feature. | More than half of the time was consumed during testing for tuning tags containing the Sem feature. | ||
+ | |||
+ | ===== Danish (da) ===== | ||
+ | |||
+ | Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish. | ||
+ | |||
+ | Total work time: about 7 hours | ||
+ | |||
+ | ===== English (en) ===== | ||
+ | |||
+ | ==== Penn Treebank Tagset ==== | ||
+ | |||
+ | Penn Treebank (45 atomic tags). Detailed classification of punctuation. | ||
+ | |||
+ | Total work time: about 3 hours | ||
+ | |||
+ | ==== CoNLL Tagset (derived from Penn tags) ==== | ||
+ | |||
+ | The driver is just an envelope around the '' | ||
+ | |||
+ | Total work time: 48 minutes | ||
===== German (de) ===== | ===== German (de) ===== | ||
Line 59: | Line 99: | ||
Total work time: 10 min | Total work time: 10 min | ||
- | ===== Time needed for tag set conversion ===== | ||
- | Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače. | + | ===== Portuguese (pt) ===== |
- | Ruský | + | The Portuguese CoNLL treebank |
- | 12:36 | + | |
- | Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006): | + | | **Feature** | **Explanation** | **Examples** | |
- | 4:45+1+1:40 = 7:25 | + | | _ | no features | prepositions, punctuation etc. | |
+ | | 1 | 1st person | | | ||
+ | | 1/3S | 1st person or 3rd person singular | leia, disse, seria, prefira | | ||
+ | | 1S | 1st person singular | tenho, tinha, usei, vivo, vou | | ||
+ | | 1P | 1st person plural | tomámos, vamos, vemos, víamos | | ||
+ | | 2S | 2nd person singular | compreendeste, | ||
+ | | 2P | 2nd person plural | chamais, vós | | ||
+ | | 3S | 3rd person singular | viu, viva | | ||
+ | | 3S/P | 3rd person singular or plural | se, si | | ||
+ | | 3P | 3rd person plural | vivem | | ||
+ | | ACC | pronoun as direct accusative object | se, te, vos | | ||
+ | | ACC/DAT | pronouns in accusative or dative | nos, se | | ||
+ | | > | noise; should be ignored | | | ||
+ | | 0/1/3S | noise; should probably be 1/3S | | | ||
+ | | 1S> | noise; should be 1S | meu, meus, minha, minhas | | ||
+ | | 1P> | noise; should be 1P | nossa, nossas, nosso, nossos | | ||
+ | | 2S> | noise; should be 2S | seu, teu | | ||
+ | | 2P> | noise; should be 2P | vossa, vosso | | ||
+ | | 3S> | noise; should be 3S | seu, seus, sua, suas | | ||
+ | | 3S/P> | noise; should be 3S/P | seu, seus, sua | | ||
+ | | 3P> | noise; should be 3P | seu, seus, sua | | ||
- | Dánské značky DDT/ | + | ===== Swedish |
- | asi 7 hodin | + | |
- | Švédské značky | + | ==== Mamba and CoNLL ==== |
- | asi 3 hodiny | + | |
- | Penn Treebank (36 značek) | + | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba. |
- | asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad | + | |
- | Hajičovy švédské značky | + | Total work time: about 3 hours |
- | 0:32 - tady zjevně chybí úplná statistika | + | |
- | Arabské značky CoNLL | + | ==== Tags of Hajič's Swedish tagger ==== |
- | 4: | + | |
- | Bulharské značky CoNLL | + | Based on PAROLE Swedish tagset but some characters different |
- | 0: | + | |
- | (ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené) | + | |
- | Anglické značky CoNLL | + | No reliable statistics of work time; estimated 8 hours |
- | 0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne? | + | |
- | Žádné z výše uvedených převodů | + | ===== Time needed for tag set conversion ===== |
+ | |||
+ | Some records about targeted tagset conversion for given tagset pairs, done in early 2006: | ||
+ | |||
+ | Ruský treebank | ||
+ | 12:36 | ||
+ | |||
+ | Arabské značky (Otovy i Buckwalterovy, | ||
+ | 4:45+1+1:40 = 7:25 | ||