[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:versions [2011/06/27 15:56]
zeman Changes since 1.1.
user:zeman:interset:versions [2014/06/11 10:52]
zeman Interset 2.0.
Line 22: Line 22:
 ! 8 September 2009. Three new incarnations of Czech, English and German CoNLL tagsets, reflecting the 2009 changes in format. Most interestingly, German tags now contain morphosyntactic features. Thanks to Saša Rosen, who tries to use DZ Interset together with a multi-language parallel corpus called Intercorp, we also created a driver for the IPI PAN Polish corpus, which in turn caused one systemic change: o-tags (those setting the ''other'' feature) [[how-to-write-a-driver#replacing-and-the-other-feature|can now be ignored]] when the driver is scanning the possible feature-value combinations. And there is a new [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl|web interface]] to DZ Interset. ! 8 September 2009. Three new incarnations of Czech, English and German CoNLL tagsets, reflecting the 2009 changes in format. Most interestingly, German tags now contain morphosyntactic features. Thanks to Saša Rosen, who tries to use DZ Interset together with a multi-language parallel corpus called Intercorp, we also created a driver for the IPI PAN Polish corpus, which in turn caused one systemic change: o-tags (those setting the ''other'' feature) [[how-to-write-a-driver#replacing-and-the-other-feature|can now be ignored]] when the driver is scanning the possible feature-value combinations. And there is a new [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl|web interface]] to DZ Interset.
  
-Changes since then +1.2 
-! New drivers: Prague Spoken Corpus (Pražský mluvený korpus, PMK) long and short tags (''cs::pmkdl'' and ''cs::pmkkr''). Arabic CoNLL 2007 slightly differs from CoNLL 2006, so there is now ''ar::conll2007''.+27 June 2011. New drivers: Prague Spoken Corpus (Pražský mluvený korpus, PMK) long and short tags (''cs::pmkdl'' and ''cs::pmkkr''). Arabic CoNLL 2007 slightly differs from CoNLL 2006, so there is now ''ar::conll2007''. 
 ! New test: For all tags in all drivers now must hold that deleting the value of the ''other'' feature does not lead to an unknown tag. This should greatly improve chances of finding permitted feature combinations when converting from one tagset to another. ! New test: For all tags in all drivers now must hold that deleting the value of the ''other'' feature does not lead to an unknown tag. This should greatly improve chances of finding permitted feature combinations when converting from one tagset to another.
 +
 ! New usage: Interset in Treex (TectoMT). ! New usage: Interset in Treex (TectoMT).
 +
 +? Changes since then
 +! I am working on Interset 2.0, to be released in the second half of 2014. It will be a complete rewrite of Interset, using Moose, the object-oriented extension of Perl 5. I also plan exportable conversion tables that will bring Interset functionality to programming languages other than Perl.
 +
 +! Feature changes:
 +  * The ''prep'' value of the ''pos'' feature (preposition) will be renamed to ''adp'' (adposition) because it covers prepositions, postpositions and circumpositions.
 +  * The ''subpos'' feature will be partially divided in several new features that reflect the main part of speech: ''nountype'', ''adjtype'', ''verbtype'' and ''conjtype''. This is a logical extension of previously created ''prontype'', ''advtype'' etc. I have not yet decided whether ''subpos'' will disappear completely or there will be a small set of values that will remain in ''subpos''.
 +  * I am considering removal of the feature ''synpos''. Investigation is needed to what extent it is actually used in what tagsets and whether or not it overlaps with information stored elsewhere.
 +  * The features ''tense'' and ''subtense'' have been merged. Their separation in the early years of Interset was driven by problems with encoding tagsets that lacked specialized tenses; later on however, Interset got the algorithms for strict encoding and feature replacement. Now there are other features whose values form a hierarchy, so it seems logical to treat tenses the same way.
 +  * I am considering further changes in partition of numerals, in a similar spirit as with pronouns. Many words that are considered numerals in Czech are tagged as nouns, adjectives, pronouns, determiners or adverbs in other tagsets. I may decide to keep a separate part of speech for cardinal numbers but I have not arrived at a clear opinion yet.

[ Back to the navigation ] [ Back to the content ]