Differences

This shows you the differences between two versions of the page.

--- user:zeman:interset:drivers [2008/03/31 22:14]
zeman de::conll
+++ user:zeman:interset:drivers [2008/04/03 14:27]
zeman Restructuralization.
@@ Line 2: / Line 2: @@
 This is an overview of existing tag set drivers. Tag-set or language specific issues are described here.
+===== Arabic (ar) =====
+The Arabic CoNLL tags are derived from the tags of the Prague Arabic Dependency Treebank.
+Created in 2006-2007.
+Total work time: 13 hours
+===== Bulgarian (bg) =====
+The Bulgarian CoNLL tags are derived from the tags of BulTreeBank. Speciality: sophisticated system of pronouns includes interrogative adverbs and numerals.
+Created in 2007.
+Total work time: 35 hours
+The main reasons why the implementation took so long:
+  * Necessity to re-work the system of main word classes, especially pronouns.
+  * Necessity to separate morphological and lexical definiteness (there are indefinite pronouns morphologically definite, and vice versa).
+  * Necessity to separate morphological and lexical aspect (aorist vs. imperfect tense; there are perfective verbs that can occur in imperfect tense).
+  * Driver tester required that encode(decode(x))=x. However, the CoNLL incarnation of the tags was inconsistent, in the order and form in which it presented features.
 ===== Chinese (zh) =====
@@ Line 38: / Line 58: @@
 More than half of the time was consumed during testing for tuning tags containing the Sem feature.
+===== Danish (da) =====
+Tags of the Danish Dependency Treebank converted to CoNLL format. 144 tags with complex documentation in Danish.
+Total work time: about 7 hours
+===== English (en) =====
+==== Penn Treebank Tagset ====
+Penn Treebank (45 atomic tags). Detailed classification of punctuation.
+Total work time: about 3 hours
+==== CoNLL Tagset (derived from Penn tags) ====
+The driver is just an envelope around the ''en::penn'' driver.
+Total work time: 48 minutes
 ===== German (de) =====
@@ Line 59: / Line 99: @@
 Total work time: 10 min
-===== Time needed for tag set conversion =====
+===== Swedish (sv) =====
-Poznamenávám si, kolik času mi zabral který ovladač, abych to mohl publikovat. Srovnání potřebného času s časem potřebným na obyčejný převod je zajímavé, i když vím, že ve skutečnosti ušetřím až při opakovaném využití ovladače.
+==== Mamba and CoNLL ====
-Ruský treebank (nejen značky, ale vůbec převod formátu):
+Mamba tagset of Talbanken05. 48 tags, no morphosyntactic categories but detailed classification of auxiliary and modal verbs and punctuation. CoNLL driver is just an envelope around Mamba.
-:36
-Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
+Total work time: about 3 hours
-:45+1+1:40 = 7:25
-Dánské značky DDT/Parole (144 značek s košatým popisem)
+==== Tags of Hajič's Swedish tagger ====
-asi 7 hodin
-Švédské značky Mamba (48 značek)
+Based on PAROLE Swedish tagset but some characters different (@ => W), and filled by dashes to uniform length of 9 characters (although i-th position does not always encode the same feature).
-asi 3 hodiny
-Penn Treebank (36 značek)
+No reliable statistics of work time; estimated 8 hours
-asi 3 hodiny, ale tady jsem to ještě neměřil, takže to je jen hrubý zpětný odhad
-Hajičovy švédské značky
+===== Time needed for tag set conversion =====
-:32 - tady zjevně chybí úplná statistika
-Arabské značky CoNLL
+Some records about targeted tagset conversion for given tagset pairs, done in early 2006:
-:33+5:19+3:16 = 13:08
-Bulharské značky CoNLL
+Ruský treebank (nejen značky, ale vůbec převod formátu):
-:20+1:00+0:26+5:44+2:00+6:15+1:20+0:46+1:26+2:30+0:48+12:44 = 35:19
+:36
-(ale u bulharštiny jsem se dost natrápil s jevy, které do té doby nebyly v intersetu podchycené)
-Anglické značky CoNLL
+Arabské značky (Otovy i Buckwalterovy, ještě bez Intersetu, 22.3.2006):
-:48 - možná tady chybí statistika, ale možná taky ne, protože stačilo upravit existující ovladač Penn Treebanku, ne?
+:45+1+1:40 = 7:25
-Žádné z výše uvedených převodů (tedy vše napsané před říjnem 2007) ještě neměly k dispozici chytré funkce pro nahrazování nepovolených hodnot.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences