Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
user:zeman:interset:drivers [2009/03/24 12:14] zeman CoNLL 2009 finished and tested. |
user:zeman:interset:drivers [2009/09/08 18:18] zeman Summary of pl::ipipan. |
| |
The [[:format-conll|CoNLL data format]] has changed. Formerly (2006 and 2007) there were three relevant columns (coarse-grained part of speech, fine-grained part of speech, and features) that we combined (using tabs) into one tag string. As of 2009, there are only two columns left, namely part of speech and features. For the Czech tags this further means that there is a new feature in the 'features' column. It is called ''SubPOS'', it is present in all tags and its value is one character, copied from the second position of the standard PDT tag. Otherwise, the tags should be identical to those of CoNLL 2007, including the ''Sem'' feature. | The [[:format-conll|CoNLL data format]] has changed. Formerly (2006 and 2007) there were three relevant columns (coarse-grained part of speech, fine-grained part of speech, and features) that we combined (using tabs) into one tag string. As of 2009, there are only two columns left, namely part of speech and features. For the Czech tags this further means that there is a new feature in the 'features' column. It is called ''SubPOS'', it is present in all tags and its value is one character, copied from the second position of the standard PDT tag. Otherwise, the tags should be identical to those of CoNLL 2007, including the ''Sem'' feature. |
| |
| The ''Sem'' feature can have more values than previously. This is caused by the extension of the [[http://ufal.mff.cuni.cz/~zeman/publikace/2005-01/mmanual.html#lemma-term|term value set]] in the Prague Dependency Treebank 2.0 (in contrast to 1.0), so the change actually applied already to CoNLL 2007 data. However, CoNLL 2007 uses the older ''cs::conll'' driver. |
| |
Work started: 24.3.2009 | Work started: 24.3.2009 |
Total work time: about 3 hours | Total work time: about 3 hours |
| |
==== CoNLL Tagset (derived from Penn tags) ==== | ==== CoNLL 2006 ==== |
| |
The driver is just an envelope around the ''en::penn'' driver. | The driver is just an envelope around the ''en::penn'' driver. |
| |
Total work time: 48 minutes | Total work time: 48 minutes |
| |
| ==== CoNLL 2009 ==== |
| |
| Another envelope around the ''en::penn'' driver. However, three new tags required changes even in the older drivers: ''HYPH'', ''AFX'' (''PRF'') and ''NIL''. |
| |
| Work started: 25.3.2009 |
| Work finished: 25.3.2009 |
| Total work time: 2:57 h |
| |
===== German (de) ===== | ===== German (de) ===== |
Total work time: 4:00 h | Total work time: 4:00 h |
| |
==== CoNLL (derived from STTS) ==== | ==== CoNLL 2006 ==== |
| |
Only simple envelope around the STTS driver needed. | Only simple envelope around the STTS driver needed. |
Work finished: 31.3.2008 | Work finished: 31.3.2008 |
Total work time: 10 min | Total work time: 10 min |
| |
| |
| ==== CoNLL 2009 ==== |
| |
| This tagset is derived from the STTS, too. Unlike CoNLL 2006, there are also morphological features this time, which required additional processing effort. |
| |
| Work started: 5.4.2009 |
| Work finished: 6.4.2009 |
| Total work time: 9:39 h |
| |
| |
| |
| |
| ===== Polish (pl) ===== |
| |
| Based on the [[http://korpus.pl/index.php|Korpus Języka Polskiego IPI PAN]]. (Saša tyhle značky potřebuje zpracovat v Intercorpu.) Moderate amount of new stuff but it is one of the fairly complex Slavic tagsets. And it contributed to [[how-to-write-a-driver#replacing-and-the-other-feature|new treatment of o-tags]] (those setting the ''other'' feature) when learning permitted feature-value combinations. |
| |
| Work started: 4.9.2009 |
| Work finished: 8.9.2009 |
| Total work time: 9:54 h |
| |
===== Portuguese (pt) ===== | ===== Portuguese (pt) ===== |