Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:interset:how-to-write-a-driver [2007/10/01 14:29] zeman Replacing values. |
user:zeman:interset:how-to-write-a-driver [2017/01/16 13:08] (current) zeman Warning: obsolete. |
====== How to write a driver ====== | ====== How to write a driver (THIS PAGE IS OBSOLETE—IT IS ABOUT INTERSET VERSION 1) ====== |
| |
Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: ''decode()'', ''encode()'', ''list()''. The ''tagset/common.pm'' module contains some code you may have use for, so your driver module should start with | Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: ''decode()'', ''encode()'', ''list()''. The ''tagset/common.pm'' module contains some code you may have use for, so your driver module should start with |
The input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend "\t" (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML! | The input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend "\t" (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML! |
| |
Empty feature value means "unknown". It is even not known, whether this feature would be relevant. If we know that a feature is irrelevant, we can set it to "n/a" (not applicable; although not mentioned explicitly, this value is allowed for all features). However, I am not sure whether this should be used at all. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal "tagset", we probably better leave the value empty or even set it to an appropriate default. | Empty feature value means "unknown". It is even not known, whether this feature would be relevant. Some tagsets distinguish between unknown values and irrelevant features. This is not the case of Interset. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal "tagset", we probably better leave the value empty or even set it to an appropriate default. |
| |
===== decode() ===== | ===== decode() ===== |
| |
If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). | If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). |
| |
| |
| |
===== encode() ===== | ===== encode() ===== |
Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call ''tagset::single_values()'' to get rid of the arrays (or some of them). See [[#Alternative values]] for more details. | Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call ''tagset::single_values()'' to get rid of the arrays (or some of them). See [[#Alternative values]] for more details. |
| |
**WARNING:** Before modifying the contents of ''%f'', you should make a //deep// copy of it. You cannot assume that the user of the driver will not need the values in ''%f'' after encoding. If you have called ''single_values()'', it made the copy for you. | **WARNING:** Before modifying the contents of the feature structure, you should make a //deep// copy of it. You cannot assume that the user of the driver will not need the original values of the features after encoding. If you have called ''single_values()'', it made the copy for you. |
| |
===== list() ===== | ===== list() ===== |
Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters. | Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters. |
| |
If the arrays turns out to be incompatible with what you expect, you should pick one value (we suggest you take the first one) and proceed with default single-value processing. | If the array turns out to be incompatible with what you expect, you should pick one value (we suggest you take the first one) and proceed with default single-value processing. |
| |
<code perl> | <code perl> |
| |
Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your ''BEGIN'' block, you should call ''tagset::common::get_permitted_values()'' and store the hash reference it returns. The hash (of arrays) will contain a list of permitted values for every feature. When you later need to check a value and replace it if necessary, you pass the hash reference back to ''tagset::common'': | Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your ''BEGIN'' block, you should call ''tagset::common::get_permitted_values()'' and store the hash reference it returns. The hash (of arrays) will contain a list of permitted values for every feature. When you later need to check a value and replace it if necessary, you pass the hash reference back to ''tagset::common'': |
| |
| (Note that the ''list()'' function must be defined before the ''BEGIN'' block that uses it.) |
| |
<code perl> | <code perl> |
use tagset::common; | use tagset::common; |
| ... |
| sub list { ... } |
| ... |
BEGIN | BEGIN |
{ | { |
# Store the hash reference in a global variable. | # Store the hash reference in a global variable. |
$permitvals = tagset::common::get_permitted_values(list()); | $permitvals = tagset::common::get_permitted_values(list(), \&decode); |
} | } |
... | ... |
$replacement = tagset::common::check_value($feature, $value, $permitvals); | $replacement = tagset::common::check_value($feature, $value, $permitvals); |
| </code> |
| |
| Alternatively, the following checks **and replaces** values of all features in a feature structure: |
| |
| <code perl> |
| tagset::common::enforce_permitted_values($fstruct, $permitvals); |
| </code> |
| |
| If an array is a permitted value, all member values are permitted. |
| |
| If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced. |
| |
| ===== Replacing whole feature structures with defaults ===== |
| |
| The above technique does not guarantee that the encoder will only see feature //combinations// that it normally gets from its own decoder. For instance, the ''case'' feature may be known in the given tagset but it only occurs with nouns. When encoding features produced by a different driver, there could be non-empty case also for adjectives, pronouns etc. Even though the encoder may be able to encode the new feature combination (using the tagset's usual way of expressing the values "adjective" and, say, "genitive") it is not always desirable (depending on the application) to invent new tags. Default encoding should be **strict,** meaning that only the tags returned by ''list()'' can be returned by the encoder. |
| |
| Similarly to the replacement of separate values, the encoder can ask the Interset common library to replace the whole structure by something the encoder is used to (i.e. by a structure that results from decoding of a tag known by the driver). This usually removes from the encoder the burden of thinking about exotic features and values. |
| |
| The correcting function tries to lie as little as possible. There is a priority value associated with every known feature. Feature values are checked (and possibly altered) in the order of feature priorities. In the above example (adjective cannot have case), the part of speech would keep its "adjective" value, and the case value would be removed. If the case feature had higher priority than the pos feature (which is not the case), a non-empty value of ''case'' would force part of speech to be changed from adjective to noun. |
| |
| <code perl> |
| use tagset::common; |
| ... |
| sub list { ... } |
| ... |
| BEGIN |
| { |
| # Store the hash reference in a global variable. |
| $permitted = tagset::common::get_permitted_structures_joint(list(), \&decode); |
| } |
| ... |
| # Give reference to feature structure. Get reference to a new one (deep copy). |
| $fs1 = tagset::common::enforce_permitted_joint($fs0, $permitted); |
| </code> |
| |
| |
| |
| ===== Replacing and the other feature ===== |
| |
| Replacing feature values with defaults has its limitations. It only works with pre-known feature values. It does not touch the features ''tagset'' and ''other''. Such behavior is indeed correct as these features only serve to preserve exotic information when encoding back into the original tagset. However, it also has its downside. |
| |
| The key problem lies in the method we use to obtain permitted combinations of feature values. All tags of the tagset are decoded into feature structures, which subsequently represent the permitted combinations. Values of ''tagset'' and ''other'' are disregarded. Now what happens if the resulting feature structure (after erasing ''tagset'' and ''other'') is unique, i.e. if if it never occurs without the particular value of ''other''<nowiki>?</nowiki> The encoder cannot create the tag corresponding to the ''other'' value (“o-tag”) because the information contained in ''other'' has been erased. It will thus create a similar tag (“s-tag”). Unfortunately, the s-tag will be constructed using feature values that can only occur with the o-tag. Thus, the s-tag will not be valid. |
| |
| **Example:** Let's assume that the target tagset contains a narrow-purpose tag called ''pred''. It is not intended for verbs but the words tagged with it often have predicative function and work instead of verbs, so we decided to decode it as a strange type of verb. We set ''pos = "verb"'' and ''other = "pred"''. Unfortunately, all verb tags in that particular tagset have the feature ''aspect'' set, which is not true for ''pred''. The part of encoder that is responsible for replacing does not see the difference between ''pred'' and ''verb'' because it does not read the value of ''other''. It will learn (wrongly) that aspectless verbs exist. And it will start creating invalid aspectless verb tags once it runs across a feature structure from a tagset that does not know about aspect. |
| |
| The example is a realistic one. O-tags (tags setting the ''other'' feature) are often minor parts of speech. They are used for tokens that hide under broader parts of speech in other tagsets. The specific usage of the o-tags however makes many features of the broader tags unnecessary. Such features are empty in o-tags while they always must be non-empty in corresponding s-tags. |
| |
| **A possible solution** would be not to use any o-tags when scanning the possible feature value combinations. This would work for numerous tagset drivers that only resort to ''other'' when dealing with a “strange” tag. One would have to make sure when distinguishing a strange tag from its normal counterpart that only the strange tag has ''other'' set, and that the normal tag has it empty (in other words, we cannot set ''other'' for both, say, ''other = "strange"'' for the former and ''other = "normal"'' for the latter). Nevertheless, there are instances where most or all the tags of a tagset are o-tags. A good example is ''zh::conll'': poorly documented set of 294 tags, with most distinctions unrepresentable in DZ Interset. Its decoder only sets ''pos'' and copies the whole tag into ''other''. Excluding o-tags (meaning all tags here) would not work with this tagset. |
| |
| **Another possible solution** is to implement a new subroutine that returns the list of the tags that can be used for scanning of permitted feature value combinations. By default, the subroutine would return the list of non-o-tags. For tagsets such as ''zh::conll'', it could create a taylored list of tags. |
| |
| **Current solution:** New (third) parameter to ''tagset::common::get_permitted_structures_joint()'', when set to 1 (''true''), says that o-tags shall be ignored. Note that it requires changing the beginning of ''encode()''. We must not enforce permitted combinations if source tagset is same as target (because we would be violating o-tags without knowing their permitted values). Instead, we should just do a simple deep copy of the feature structure to protect it from incidential modifications during encoding. The following example is taken from the ''pl::ipipan'' driver, which was the first place where the solution was applied. |
| |
| <code perl> my $f0 = shift; |
| # Modify the feature structure so that it contains values expected by this |
| # driver. Do not do that if this was also the source tagset (because the |
| # modification would damage tags using 'other'). However, in any case |
| # create a deep copy of the original feature structure so that it is |
| # protected from changes during encoding. |
| my $f; |
| if($f0->{tagset} eq 'pl::ipipan') |
| { |
| $f = tagset::common::duplicate($f0); |
| } |
| else |
| { |
| $f = tagset::common::enforce_permitted_joint($f0, $permitted); |
| } |
| my %f = %{$f}; |
</code> | </code> |
| |
===== Test your driver ===== | ===== Test your driver ===== |
| |
When you have written a driver for a new tagset, you should test it. An Interset service module can perform the following tests: | When you have written a driver for a new tagset, you should test it. The driver package contains a test script called ''driver-test.pl''. When running it, give the driver name as argument, without the ''tagset::'' prefix. You can also use the ''-d'' option to turn on debug messages (list of tags being tested). |
| |
| <code>driver-test.pl ar::conll |
| driver-test.pl bg::conll cs::pdt |
| driver-test.pl -a |
| driver-test.pl -A</code> |
| |
| Running ''driver-test.pl'' without arguments will list the drivers available on the system. Two (or more) arguments test both drivers separately and then conversions from driver A to driver B and vice versa. Running it with the ''-a'' option will test all the drivers but no conversions. The ''-A'' option tests all drivers and all conversions between all pairs of drivers. |
| |
| Note that only drivers implementing the ''list()'' function can be tested. Most testing involves generating the list of all possible tags and testing the driver on each tag separately. |
| |
| The following tests will be performed for a single driver: |
| |
| * Decode each tag and check that only known features and values are set. In addition to a built-in list, every feature can have an empty value, and the features "tagset" and "other" can have any value. |
| * Check for each tag that ''encode(decode($tag)) eq $tag''. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of //converting// them. |
| |
| The following tests will be performed for a pair of drivers: |
| |
* Get list of possible tags by calling list(). Go through the list and check for each tag that encode(decode($tag))=$tag. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of //converting// them. | * Decode every tag of the first driver, encode it using the second driver and check whether the result is a known tag in the second tagset. |
| |
To perform the test, run the script ''driver-test.pl'' in the ''tagset'' root folder. Note that the name of the driver to test is currently hard-coded into the source. In future, it will be changed to a command-line argument. | |