This is an old revision of the document!
Version History of DZ Interset
As many other projects, DZ Interset has gone through its “Dark Age” when it was not yet clear whether it would eventually be published. There was no distinction between versions and releases, and versions were not numbered anyway. However, there were some milestones and I am now going to number them for convenience.
? 0.1
! Summer 2006. My first unified approach to conversions among the Prague Dependency Treebank tagset, Penn TreeBank tagset, Swedish Mamba tagset (CoNLL 2006 treebank), Danish Parole tagset (CoNLL 2006 treebank), and the tagset of the Swedish adaptation of Jan Hajič's tagger. Tag conversions were crucial part of my experiments with cross-language parser adaptation (see References). My thanks go to Philip Resnik for the early comments during our discussions at the University of Maryland.
? 0.2
! Spring 2007. I struggled to convert tagsets of several CoNLL shared task treebanks in order to improve the accuracy of a parser that relied on understanding the information in the tags. It became apparent how big the differences between various tagging approaches are. Also, some corpora contained tags that were noisy or not very well defined. Arabic, Bulgarian, Chinese, Czech and English CoNLL tagsets were added (Czech and English are just reformatted PDT and Penn tags, respectively).
? 0.5
! May 2008. DZ Interset was first presented at the Language Resoruces and Evaluation Conference (LREC) in Marrakech, Morocco (see References). At that time, new drivers for the German Stuttgart-Tübingen Tagset and the Portuguese Floresta/CoNLL tagset (extremly noisy, huh!) were present.
! At the time around LREC, a major change in the feature pool started to crystallize. The diametrically different approaches to tagging of pronouns and determiners led me to remove these categories from the top-level part-of-speech set and transform them to special cases of nouns and adjectives. Such approach had already been taken a year before for Bulgarian but now I wanted to unify it across languages. In the end of 2008, all drivers already reflected the changed policy. The state of pronouns may further change in future, as this is a rather controversial issue. On the other hand, a similar change may be needed for numerals, too.
? 1.0
! February 2009. Petr Pořízka and Markus Schäfer use DZ Interset in MorphCon, a GUI tool for conversion of Czech morphological tags. They wrote a driver for the Czech ajka tagset (a morphological analyzer from Masaryk University, Brno). MorphCon has been presented at a bohemistic conference in Brno (see References). Dan added a driver for the Czech tags of the Multext East multilingual corpus.
! Various maintenance changes took place, too. Version control has been migrated to network-accessible (though not publicly accessible) SVN repository, together with Trac project management interface. Website now includes information on licensing, references and this version history. From now on, I intend to distinguish revisions from numbered releases.
? 1.1
! 8 September 2009. Three new incarnations of Czech, English and German CoNLL tagsets, reflecting the 2009 changes in format. Most interestingly, German tags now contain morphosyntactic features. Thanks to Saša Rosen, who tries to use DZ Interset together with a multi-language parallel corpus called Intercorp, we also created a driver for the IPI PAN Polish corpus, which in turn caused one systemic change: o-tags (those setting the other
feature) can now be ignored when the driver is scanning the possible feature-value combinations. And there is a new web interface to DZ Interset.
? 1.2
! 27 June 2011. New drivers: Prague Spoken Corpus (Pražský mluvený korpus, PMK) long and short tags (cs::pmkdl
and cs::pmkkr
). Arabic CoNLL 2007 slightly differs from CoNLL 2006, so there is now ar::conll2007
.
! New test: For all tags in all drivers now must hold that deleting the value of the other
feature does not lead to an unknown tag. This should greatly improve chances of finding permitted feature combinations when converting from one tagset to another.
! New usage: Interset in Treex (TectoMT).
? Changes since then
! I am working on Interset 2.0, to be released in the second half of 2014. It will be a complete rewrite of Interset, using Moose, the object-oriented extension of Perl 5; it will be published at CPAN as Lingua::Interset
. I also plan exportable conversion tables that will bring Interset functionality to programming languages other than Perl.
! Feature changes:
- The
prep
value of thepos
feature (preposition) will be renamed toadp
(adposition) because it covers prepositions, postpositions and circumpositions. - The
subpos
feature will be partially divided in several new features that reflect the main part of speech:nountype
,adjtype
,verbtype
andconjtype
. This is a logical extension of previously createdprontype
,advtype
etc. I have not yet decided whethersubpos
will disappear completely or there will be a small set of values that will remain insubpos
. - I am considering removal of the feature
synpos
. Investigation is needed to what extent it is actually used in what tagsets and whether or not it overlaps with information stored elsewhere. - The features
tense
andsubtense
have been merged. Their separation in the early years of Interset was driven by problems with encoding tagsets that lacked specialized tenses; later on however, Interset got the algorithms for strict encoding and feature replacement. Now there are other features whose values form a hierarchy, so it seems logical to treat tenses the same way. - I am considering further changes in partition of numerals, in a similar spirit as with pronouns. Many words that are considered numerals in Czech are tagged as nouns, adjectives, pronouns, determiners or adverbs in other tagsets. I may decide to keep a separate part of speech for cardinal numbers but I have not arrived at a clear opinion yet.