Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:drivers [2008/04/03 11:49] zeman Restructuralization. |
user:zeman:interset:drivers [2008/04/03 14:27] zeman Restructuralization. |
||
---|---|---|---|
Line 2: | Line 2: | ||
This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | This is an overview of existing tag set drivers. Tag-set or language specific issues are described here. | ||
+ | |||
+ | ===== Arabic (ar) ===== | ||
+ | |||
+ | The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank. | ||
+ | |||
+ | Created in 2006-2007. | ||
+ | Total work time: 13 hours | ||
+ | |||
+ | ===== Bulgarian (bg) ===== | ||
+ | |||
+ | The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals. | ||
+ | |||
+ | Created in 2007. | ||
+ | Total work time: 35 hours | ||
+ | |||
+ | The main reasons why the implementation took so long: | ||
+ | * Necessity to re-work the system of main word classes, especially pronouns. | ||
+ | * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa). | ||
+ | * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense). | ||
+ | * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, | ||
===== Chinese (zh) ===== | ===== Chinese (zh) ===== | ||
Line 46: | Line 66: | ||
===== English (en) ===== | ===== English (en) ===== | ||
+ | |||
+ | ==== Penn Treebank Tagset ==== | ||
Penn Treebank (45 atomic tags). Detailed classification of punctuation. | Penn Treebank (45 atomic tags). Detailed classification of punctuation. | ||
Total work time: about 3 hours | Total work time: about 3 hours | ||
+ | |||
+ | ==== CoNLL Tagset (derived from Penn tags) ==== | ||
+ | |||
+ | The driver is just an envelope around the '' | ||
+ | |||
+ | Total work time: 48 minutes | ||
===== German (de) ===== | ===== German (de) ===== | ||
Line 73: | Line 101: | ||
===== Swedish (sv) ===== | ===== Swedish (sv) ===== | ||
- | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. | + | ==== Mamba and CoNLL ==== |
+ | |||
+ | Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba. | ||
Total work time: about 3 hours | Total work time: about 3 hours | ||
+ | |||
+ | ==== Tags of Hajič' | ||
+ | |||
+ | Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature). | ||
+ | |||
+ | No reliable statistics of work time; estimated 8 hours | ||
===== Time needed for tag set conversion ===== | ===== Time needed for tag set conversion ===== | ||
- | Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače. | + | Some records about targeted tagset conversion for given tagset pairs, done in early 2006: |
Ruský treebank (nejen značky, ale vůbec převod formátu): | Ruský treebank (nejen značky, ale vůbec převod formátu): | ||
Line 86: | Line 122: | ||
Arabské značky (Otovy i Buckwalterovy, | Arabské značky (Otovy i Buckwalterovy, | ||
4:45+1+1:40 = 7:25 | 4:45+1+1:40 = 7:25 | ||
- | |||
- | Hajičovy švédské značky | ||
- | 0:32 - tady zjevně chybí úplná statistika | ||
- | |||
- | Arabské značky CoNLL | ||
- | 4: | ||
- | |||
- | Bulharské značky CoNLL | ||
- | 0: | ||
- | (ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené) | ||
- | |||
- | Anglické značky CoNLL | ||
- | 0:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne? | ||
- | |||
- | Žádné z výše uvedených převodů (tedy vše napsané před říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot. | ||