[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:interset:versions [2009/09/08 18:09]
zeman Link.
user:zeman:interset:versions [2014/06/16 21:47] (current)
zeman Version 2 is out.
Line 20: Line 20:
  
 ? 1.1 ? 1.1
-! 8 September 2009. Three new incarnations of Czech, English and German CoNLL tagsets, reflecting the 2009 changes in format. Most interestingly, German tags now contain morphosyntactic features. Thanks to Saša Rosen, who tries to use DZ Interset together with a multi-language parallel corpus called Intercorp, we also created a driver for the IPI PAN Polish corpus, which in turn caused one systemic change: o-tags (those setting the ''other'' feature) [[how-to-write-a-driver#replacing-and-the-other-feature|can now be ignored]] when the driver is scanning the possible feature-value combinations.+! 8 September 2009. Three new incarnations of Czech, English and German CoNLL tagsets, reflecting the 2009 changes in format. Most interestingly, German tags now contain morphosyntactic features. Thanks to Saša Rosen, who tries to use DZ Interset together with a multi-language parallel corpus called Intercorp, we also created a driver for the IPI PAN Polish corpus, which in turn caused one systemic change: o-tags (those setting the ''other'' feature) [[how-to-write-a-driver#replacing-and-the-other-feature|can now be ignored]] when the driver is scanning the possible feature-value combinations. And there is a new [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl|web interface]] to DZ Interset. 
 + 
 +? 1.2 
 +! 27 June 2011. New drivers: Prague Spoken Corpus (Pražský mluvený korpus, PMK) long and short tags (''cs::pmkdl'' and ''cs::pmkkr''). Arabic CoNLL 2007 slightly differs from CoNLL 2006, so there is now ''ar::conll2007''
 + 
 +! New test: For all tags in all drivers now must hold that deleting the value of the ''other'' feature does not lead to an unknown tag. This should greatly improve chances of finding permitted feature combinations when converting from one tagset to another. 
 + 
 +! New usage: Interset in Treex (TectoMT). 
 + 
 +? 2.001 
 +! 13 June 2014. Complete rewrite of Interset. The old Perl interface was not object-oriented. The modules resided under the “tagset” namespace (yes, all lowercase). The new modules are object-oriented (using Moose) and the new namespace is [[http://search.cpan.org/search?query=Lingua%3A%3AInterset&mode=all|Lingua::Interset]]. And it is available at the CPAN. 
 +  * Drivers will be ported gradually but Interset 2.0 is still able to work with old drivers that you have installed in ''lib/tagset''. Initially, only the ''en::penn'' driver has been ported. 
 +  * Project development has left our [[https://svn.ms.mff.cuni.cz/trac/interset/timeline|SVN server]] and landed on our [[https://redmine.ms.mff.cuni.cz/projects/interset/repository|Redmine server]]. Version control is now performed by Git. 
 +  * For the record: The project has also [[http://ufal.mff.cuni.cz/interset|its page at the main ÚFAL website]]. It is pretty much empty at the moment. It may eventually become the main website of the project but not before the webmaster fixes HTML entities being damaged by Drupal. 
 + 
 +! Feature changes: 
 +  * Several new features were split from the subpos feature: nountype, adjtype, verbtype and conjtype. This is a logical extension of the previously created prontype, advtype etc. 
 +  * The features tense and subtense have been merged. Their separation in the early years of Interset was driven by problems with encoding tagsets that lacked specialized tenses; later on however, Interset got the algorithms for strict encoding and feature replacement. Now there are other features whose values form a hierarchy, so it seems logical to treat tenses the same way. 
 + 
 +! **For a more detailed list of changes, see either the ''Changes'' file in the distribution, or the revision history in [[https://redmine.ms.mff.cuni.cz/projects/interset/repository|Redmine]].**
  
 ? Changes since then ? Changes since then
--+I also plan exportable conversion tables that will bring Interset functionality to programming languages other than Perl. 
 + 
 +! Feature changes: 
 +  * I am considering removal of the feature ''synpos''. Investigation is needed to what extent it is actually used in what tagsets and whether or not it overlaps with information stored elsewhere. 
 +  * I am considering further changes in partition of numerals, in a similar spirit as with pronouns. Many words that are considered numerals in Czech are tagged as nouns, adjectives, pronouns, determiners or adverbs in other tagsets. I may decide to keep a separate part of speech for cardinal numbers but I have not arrived at a clear opinion yet.

[ Back to the navigation ] [ Back to the content ]