Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:how-to-write-a-driver [2007/03/07 10:33] zeman |
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:09] zeman Typo. |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== How to write a driver ====== | ====== How to write a driver ====== | ||
- | Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: decode(), encode(), list(). | + | Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: |
- | Input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend " | + | <code perl>use tagset:: |
- | Empty feature value means " | + | The input/ |
+ | |||
+ | Empty feature value means " | ||
===== decode() ===== | ===== decode() ===== | ||
Line 11: | Line 13: | ||
This function has one string argument, the tag. The function returns a reference to a hash of features (feature names are hash keys to the feature values). | This function has one string argument, the tag. The function returns a reference to a hash of features (feature names are hash keys to the feature values). | ||
- | The decoder is not obliged to set any feature. If the decoder decides to set a feature, it should be one of the pre-defined values. This can be checked by a central procedure. However, it is not mandatory, so if the appropriate value is not available, you can use your own, but please do **[[zeman@ufal.mff.cuni.cz|let me know]]** so I can update the central value pool accordingly. | + | The decoder is not obliged to set any feature. If the decoder decides to set a feature, it should be one of the pre-defined values. This can be checked by a central procedure. However, it is not mandatory, so if the appropriate value is not available, you can use your own, but please do **[[zeman@ufal.mff.cuni.cz|let me know]]** so I can update the [[features|central value pool]] accordingly. |
If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). | If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). | ||
+ | |||
+ | |||
+ | |||
===== encode() ===== | ===== encode() ===== | ||
Line 19: | Line 24: | ||
This function has one argument, a reference to a hash of features (feature names are hash keys to the feature values). The function returns a string - the tag. | This function has one argument, a reference to a hash of features (feature names are hash keys to the feature values). The function returns a string - the tag. | ||
- | The encoder should be able to process all possible values from the central pool. If the tagset does not recognize a value, the most appropriate substitute should be chosen. | + | The encoder should be able to process all possible values from the [[features|central pool]]. If the tagset does not recognize a value, the most appropriate substitute should be chosen. |
- | Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call tagset:: | + | Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call '' |
- | **WARNING: | + | **WARNING: |
===== list() ===== | ===== list() ===== | ||
Line 30: | Line 35: | ||
The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, | The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, | ||
+ | |||
===== Alternative values ===== | ===== Alternative values ===== | ||
Line 60: | Line 66: | ||
Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to '' | Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to '' | ||
- | If the arrays | + | If the array turns out to be incompatible with what you expect, you should pick one value (we suggest you take the first one) and proceed with default single-value processing. |
<code perl> | <code perl> | ||
Line 89: | Line 95: | ||
**Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, | **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | ===== Replacing feature values with defaults ===== | ||
+ | |||
+ | The encoder' | ||
+ | |||
+ | - A table of replacement values for each value, ordered by precedence. There is a default table in '' | ||
+ | - The list of all tags in the tag set (implemented by the '' | ||
+ | |||
+ | Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your '' | ||
+ | |||
+ | <code perl> | ||
+ | use tagset:: | ||
+ | BEGIN | ||
+ | { | ||
+ | # Store the hash reference in a global variable. | ||
+ | $permitvals = tagset:: | ||
+ | } | ||
+ | ... | ||
+ | $replacement = tagset:: | ||
+ | </ | ||
+ | |||
+ | Alternatively, | ||
+ | |||
+ | <code perl> | ||
+ | tagset:: | ||
+ | </ | ||
+ | |||
+ | If an array is a permitted value, all member values are permitted. | ||
+ | |||
+ | If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced. | ||
===== Common problems ===== | ===== Common problems ===== | ||
See [[user: | See [[user: | ||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
+ | |||
===== Test your driver ===== | ===== Test your driver ===== | ||
- | When you have written a driver for a new tagset, you should test it. An Interset service module | + | When you have written a driver for a new tagset, you should test it. The driver package contains a test script called '' |
+ | |||
+ | < | ||
+ | driver-test.pl -a</ | ||
+ | |||
+ | Running '' | ||
+ | |||
+ | Note that only drivers implementing the '' | ||
- | * Get list of possible tags by calling list(). Go through the list and check for each tag that encode(decode($tag))=$tag. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of // | + | The following tests will be performed: |
- | To perform the test, run the script | + | * Decode each tag and check that only known features and values are set. In addition to a built-in list, every feature can have an empty value, and the features " |
+ | * Check for each tag that '' | ||