Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:06] zeman %f was not defined here. |
user:zeman:interset:how-to-write-a-driver [2008/03/14 12:16] zeman driver-test.pl made more consistent. |
||
---|---|---|---|
Line 16: | Line 16: | ||
If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). | If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). | ||
- | |||
- | |||
- | |||
===== encode() ===== | ===== encode() ===== | ||
Line 65: | Line 62: | ||
Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to '' | Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to '' | ||
- | If the arrays | + | If the array turns out to be incompatible with what you expect, you should pick one value (we suggest you take the first one) and proceed with default single-value processing. |
<code perl> | <code perl> | ||
Line 94: | Line 91: | ||
**Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, | **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, | ||
- | |||
- | |||
- | |||
- | |||
- | |||
===== Replacing feature values with defaults ===== | ===== Replacing feature values with defaults ===== | ||
Line 108: | Line 100: | ||
Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your '' | Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your '' | ||
+ | |||
+ | (Note that the '' | ||
<code perl> | <code perl> | ||
use tagset:: | use tagset:: | ||
+ | ... | ||
+ | sub list { ... } | ||
+ | ... | ||
BEGIN | BEGIN | ||
{ | { | ||
Line 129: | Line 126: | ||
If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced. | If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced. | ||
+ | |||
+ | ===== Replacing whole feature structures with defaults ===== | ||
+ | |||
+ | The above technique does not guarantee that the encoder will only see feature // | ||
+ | |||
+ | Similarly to the replacement of separate values, the encoder can ask the Interset common library to replace the whole structure by something the encoder is used to (i.e. by a structure that results from decoding of a tag known by the driver). This usually removes from the encoder the burden of thinking about exotic features and values. | ||
+ | |||
+ | The correcting function tries to lie as little as possible. There is a priority value associated with every known feature. Feature values are checked (and possibly altered) in the order of feature priorities. In the above example (adjective cannot have case), the part of speech would keep its " | ||
+ | |||
+ | <code perl> | ||
+ | use tagset:: | ||
+ | ... | ||
+ | sub list { ... } | ||
+ | ... | ||
+ | BEGIN | ||
+ | { | ||
+ | # Store the hash reference in a global variable. | ||
+ | $permitted = tagset:: | ||
+ | } | ||
+ | ... | ||
+ | # Give reference to feature structure. Get reference to a new one (deep copy). | ||
+ | $fs1 = tagset:: | ||
+ | </ | ||
===== Common problems ===== | ===== Common problems ===== | ||
See [[user: | See [[user: | ||
+ | |||
Line 145: | Line 166: | ||
< | < | ||
- | driver-test.pl -a</ | + | driver-test.pl bg::conll cs::pdt |
+ | driver-test.pl -a | ||
+ | driver-test.pl -A</ | ||
- | Running '' | + | Running '' |
Note that only drivers implementing the '' | Note that only drivers implementing the '' | ||
- | The following tests will be performed: | + | The following tests will be performed |
* Decode each tag and check that only known features and values are set. In addition to a built-in list, every feature can have an empty value, and the features " | * Decode each tag and check that only known features and values are set. In addition to a built-in list, every feature can have an empty value, and the features " | ||
* Check for each tag that '' | * Check for each tag that '' | ||
+ | |||
+ | The following tests will be performed for a pair of drivers: | ||
+ | |||
+ | * Decode every tag of the first driver, encode it using the second driver and check whether the result is a known tag in the second tagset. | ||