This is an old revision of the document!
Table of Contents
How to write a driver
Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: decode(), encode(), list().
Input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend “\t” (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML!
Empty feature value means “unknown”. It is even not known, whether this feature would be relevant. If we know that a feature is irrelevant, we can set it to “n/a” (not applicable; although not mentioned explicitly, this value is allowed for all features). However, I am not sure whether this should be used at all. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal “tagset”, we probably better leave the value empty or even set it to an appropriate default.
decode()
This function has one string argument, the tag. The function returns a reference to a hash of features (feature names are hash keys to the feature values).
The decoder is not obliged to set any feature. If the decoder decides to set a feature, it should be one of the pre-defined values. This can be checked by a central procedure. However, it is not mandatory, so if the appropriate value is not available, you can use your own, but please do let me know so I can update the central value pool accordingly. (If you set a value that is not documented as a part of the universal set, no one else can benefit from it. If you combine your driver with another driver to convert from your tag set to the other, the other driver's encode() will not take your invented value into account. It may even behave worse than if the value was empty.)
If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it).
encode()
This function has one argument, a reference to a hash of features (feature names are hash keys to the feature values). The function returns a string - the tag.
The encoder should be able to process all possible values from the central pool. If the tagset does not recognize a value, the most appropriate substitute should be chosen.
Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call tagset::single_values() to get rid of the arrays (or some of them).
WARNING: Before modifying the contents of %f
, you should make a deep copy of it. You cannot assume that the user of the driver will not need the values in %f
after encoding.
list()
The list() function returns a reference to an array of all plausible tags in the set. Undefined return value means no list is available and thus any tag is plausible. If there is a list, strict encoding should conform to it.
The list is not necessary for the driver to work. However, it can be useful for testing the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus.
Alternative values
A feature can have two or more alternative values at the same time. This may be necessary because some tags in some tagsets are defined that way. For instance, the character H
on position 8 in the PDT Czech tagset means “tense = past or present.” Alternative values are represented by a single reference to an array of values.
The corresponding decode()
code may look like this:
elsif($tense eq "H") { $f{tense} = ["past", "pres"]; }
Note that every encode()
implementation should be prepared to work with multiple values of features! Even if your own driver does not create arrays in decode()
, you should be able to deal with arrays coming from other tagsets. However, if you do not create arrays, your tagset is probably not capable of storing alternative feature values and you cannot do much more than just pick the first value from the array and use it. In that case it is annoying to have to check for arrays in every single feature. Fortunately, you can get rid of all arrays at once by calling the single_values()
function at the beginning of your encode()
:
use tagset; sub encode { my $f = shift; # Replace any array values by single values. my $single = tagset::single_values($f, "gender", "animateness", "number", "tense"); my %f = %{$single};
The single_values()
function will replace any array by its first member value. Note that you can specify features that are exceptions to this rule. In the above example, the features gender
, animateness
, number
and tense
will retain multiple values, if present. Thus if you are able to deal with arrays, this is the way how you specify where you can do so and where you cannot.
Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to grep
your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters.
If the arrays turns out to be incompatible with what you expect, you should pick one value (we suggest you take the first one) and proceed with default single-value processing.
if(ref($f{gender}) eq "ARRAY") { # Processing of any combined values goes here. my @values = @{$f{gender}}; if(scalar(grep{m/^masc$/}(@values)) && scalar(grep{m/^fem$/}(@values))) { $tag[2] = "T"; } elsif(...) { ... } else { $f{gender} = $f{gender}[0]; } } if(ref($f{gender}) eq "") { if($f{gender} eq "masc") { ...
Note: This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if NNQW
means a logical disjunction of the tags NNFS
and NNNP
, then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either NNFS
or NNNP
(by storing the other only), you can say that gender = F
or N
and number = S
or P
but by that you have also introduced NNFP
and NNNS
as possibilities. The approach may be revised in future.
Common problems
See Common Problems for a list of suggestions for phenomena difficult to match between tagsets and the Interset.
Test your driver
When you have written a driver for a new tagset, you should test it. An Interset service module can perform the following tests:
- Get list of possible tags by calling list(). Go through the list and check for each tag that encode(decode($tag))=$tag. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to access the tags, instead of converting them.
To perform the test, run the script driver-test.pl
in the tagset
root folder. Note that the name of the driver to test is currently hard-coded into the source. In future, it will be changed to a command-line argument.