user:zeman:interset:how-to-use [ufal wiki]

This is an old revision of the document!

Manual
- Installation
- How to use the Interset

Manual

Installation

If you exist on the ÚFAL network, you can use directly Dan's version here. Otherwise, you need to ask Dan for a zipped package of the currently existing drivers. (I intend to maintain it here for download some time later.) Unzip it to a convenient place; below, we assume it is in /home/zeman/lib/perl.

Contributions welcome! If you write your own driver, please share it with others! If you send it to me, I will add it to the package for download here.

Existing drivers

tagset::ar::conll - Arabic CoNLL treebank (coarse, fine and feat fields in one string, delimited by tabs)
tagset::bg::conll - Bulgarian CoNLL treebank
tagset::cs::pdt - Czech positional tags of the Prague Dependency Treebank
tagset::da::conll - Danish CoNLL treebank
tagset::en::conll - English CoNLL treebank
tagset::en::penn - English Penn Treebank
tagset::sv::hajic - Tags output by Swedish tagger by Jan Hajič
tagset::sv::mamba - Swedish Mamba tags from Talbanken05 (CoNLL treebank)
tagset::sv::svdahybrid - Dan's tagset, aiming at making distribution of tags from sv::hajic and da::conll as close as possible

How to use the Interset

You can write your own tag conversion Perl script, and use the Interset driver library. You have to tell Perl where to find the drivers:

setenv PERLLIB /home/zeman/lib/perl:$PERLLIB

Once the variable is set, writing a conversion script is very easy. For instance, my csts-cs-pdt-en-penn.pl script (meaning “read and write CSTS format, read Czech PDT tags, write English Penn tags) essentially looks like this:

use tagset::cs::pdt;
use tagset::en::penn;
 
while(<>)
{
    if(s/<t>([^<]+)/<_tag_to_convert_>/)
    {
        my $tag0 = $1;
        my $features = tagset::cs::pdt::decode($tag0);
        my $tag1 = tagset::en::penn::encode($features);
        s/<_tag_to_convert_>/<t>$tag1/;
    }
    print;
}

Note the two-step replacement of the original tag. I do not dare to use the original tag in a regular expression because there could be special characters in the tag.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Manual

Installation

Existing drivers

How to use the Interset