[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-write-a-driver [2007/09/26 20:32]
zeman Inventing new feature values.
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:15]
zeman Mutual positions of list() and BEGIN.
Line 1: Line 1:
 ====== How to write a driver ====== ====== How to write a driver ======
  
-Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: decode(), encode(), list().+Perl is the language to write a driver. A driver is a simple Perl module (.pm). It should implement the following functions: ''decode()''''encode()''''list()''The ''tagset/common.pm'' module contains some code you may have use for, so your driver module should start with
  
-Input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend "\t" (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML!+<code perl>use tagset::common;</code>
  
-Empty feature value means "unknown". It is even not known, whether this feature would be relevant. If we know that a feature is irrelevant, we can set it to "n/a" (not applicable; although not mentioned explicitly, this value is allowed for all features)However, I am not sure whether this should be used at all. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal "tagset", we probably better leave the value empty or even set it to an appropriate default.+The input/output tag can be any string. If the information is stored in several kinds of tags, they can be passed in one string, using some unique delimiters. We recommend "\t" (horizontal tab, ASCII 9) as delimiter. If desirable, the input/output tag can be even a multi-line XML! 
 + 
 +Empty feature value means "unknown". It is even not known, whether this feature would be relevant. Some tagsets distinguish between unknown values and irrelevant features. This is not the case of Interset. While something can be irrelevant in one tagset, we can hardly say that it is not relevant in any tagset. So, since we are setting a value in a universal "tagset", we probably better leave the value empty or even set it to an appropriate default.
  
 ===== decode() ===== ===== decode() =====
Line 11: Line 13:
 This function has one string argument, the tag. The function returns a reference to a hash of features (feature names are hash keys to the feature values). This function has one string argument, the tag. The function returns a reference to a hash of features (feature names are hash keys to the feature values).
  
-The decoder is not obliged to set any feature. If the decoder decides to set a feature, it should be one of the pre-defined values. This can be checked by a central procedure. However, it is not mandatory, so if the appropriate value is not available, you can use your own, but please do **[[zeman@ufal.mff.cuni.cz|let me know]]** so I can update the central value pool accordingly. (If you set a value that is not documented as a part of the universal set, no one else can benefit from it. If you combine your driver with another driver to convert from your tag set to the other, the other driver's encode() will not take your invented value into account. It may even behave worse than if the value was empty.)+The decoder is not obliged to set any feature. If the decoder decides to set a feature, it should be one of the pre-defined values. This can be checked by a central procedure. However, it is not mandatory, so if the appropriate value is not available, you can use your own, but please do **[[zeman@ufal.mff.cuni.cz|let me know]]** so I can update the [[features|central value pool]] accordingly. (If you set a value that is not documented as a part of the universal set, no one else can benefit from it. If you combine your driver with another driver to convert from your tag set to the other, the other driver's encode() will not take your invented value into account. It may even behave worse than if the value was empty.)
  
 If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it).
 +
 +
 +
  
 ===== encode() ===== ===== encode() =====
Line 19: Line 24:
 This function has one argument, a reference to a hash of features (feature names are hash keys to the feature values). The function returns a string - the tag. This function has one argument, a reference to a hash of features (feature names are hash keys to the feature values). The function returns a string - the tag.
  
-The encoder should be able to process all possible values from the central pool. If the tagset does not recognize a value, the most appropriate substitute should be chosen.+The encoder should be able to process all possible values from the [[features|central pool]]. If the tagset does not recognize a value, the most appropriate substitute should be chosen.
  
-Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call tagset::single_values() to get rid of the arrays (or some of them).+Since any feature can in theory have an array of values instead of a single value, the encoder should either be prepared to arrays (more precisely: array references) anywhere, or call ''tagset::single_values()'' to get rid of the arrays (or some of them). See [[#Alternative values]] for more details.
  
-**WARNING:** Before modifying the contents of ''%f'', you should make a //deep// copy of it. You cannot assume that the user of the driver will not need the values in ''%f'' after encoding.+**WARNING:** Before modifying the contents of the feature structure, you should make a //deep// copy of it. You cannot assume that the user of the driver will not need the original values of the features after encoding. If you have called ''single_values()'', it made the copy for you.
  
 ===== list() ===== ===== list() =====
Line 30: Line 35:
  
 The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus. The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus.
 +
  
 ===== Alternative values ===== ===== Alternative values =====
Line 60: Line 66:
 Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters. Now, what do you do with features where you want to encode arrays? You should first check whether the value is an array or not. If it is an array, you may want to ''grep'' your values rather than trying exact match, because you do not know what is going to come from other drivers, and the ordering or additional values may not be what matters.
  
-If the arrays turns out to be incompatible with what you expect, you should pick one value (we suggest you  take the first one) and proceed with default single-value processing.+If the array turns out to be incompatible with what you expect, you should pick one value (we suggest you  take the first one) and proceed with default single-value processing.
  
 <code perl> <code perl>
Line 89: Line 95:
  
 **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if ''NNQW'' means a logical disjunction of the tags ''NNFS'' and ''NNNP'', then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either ''NNFS'' or ''NNNP'' (by storing the other only), you can say that gender = ''F'' or ''N'' and number = ''S'' or ''P'' but by that you have also introduced ''NNFP'' and ''NNNS'' as possibilities. The approach may be revised in future. **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if ''NNQW'' means a logical disjunction of the tags ''NNFS'' and ''NNNP'', then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either ''NNFS'' or ''NNNP'' (by storing the other only), you can say that gender = ''F'' or ''N'' and number = ''S'' or ''P'' but by that you have also introduced ''NNFP'' and ''NNNS'' as possibilities. The approach may be revised in future.
 +
 +
 +
 +
 +
 +
 +
 +===== Replacing feature values with defaults =====
 +
 +The encoder's problem is that there are more feature values on input than can be encoded on output. If a value cannot be encoded, the encoder must replace it with a suitable default. Although it can control the replacement completely by its own means (e.g. by a system of ''if''-''else'' statements), there is a central system of defaults that can take care of it. The central system however needs the following:
 +
 +  - A table of replacement values for each value, ordered by precedence. There is a default table in ''tagset::common''. A driver can supply its own, if needed.
 +  - The list of all tags in the tag set (implemented by the ''list()'' driver function). Then the central system will return the highest-priority //permitted// value. A value is permitted if the tag set contains a tag that yields the value when decoded.
 +
 +Building the list of permitted values is expensive (all tags must be decoded!) and you should do it only once when your driver initializes. In your ''BEGIN'' block, you should call ''tagset::common::get_permitted_values()'' and store the hash reference it returns. The hash (of arrays) will contain a list of permitted values for every feature. When you later need to check a value and replace it if necessary, you pass the hash reference back to ''tagset::common'':
 +
 +(Note that the ''list()'' function must be defined before the ''BEGIN'' block that uses it.)
 +
 +<code perl>
 +use tagset::common;
 +...
 +sub list { ... }
 +...
 +BEGIN
 +{
 +    # Store the hash reference in a global variable.
 +    $permitvals = tagset::common::get_permitted_values(list(), \&decode);
 +}
 +...
 +$replacement = tagset::common::check_value($feature, $value, $permitvals);
 +</code>
 +
 +Alternatively, the following checks **and replaces** values of all features in a feature structure:
 +
 +<code perl>
 +tagset::common::enforce_permitted_values($fstruct, $permitvals);
 +</code>
 +
 +If an array is a permitted value, all member values are permitted.
 +
 +If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced.
  
 ===== Common problems ===== ===== Common problems =====
  
 See [[user:zeman:interset:Common Problems]] for a list of suggestions for phenomena difficult to match between tagsets and the Interset. See [[user:zeman:interset:Common Problems]] for a list of suggestions for phenomena difficult to match between tagsets and the Interset.
 +
 +
 +
 +
 +
 +
  
 ===== Test your driver ===== ===== Test your driver =====
  
-When you have written a driver for a new tagset, you should test it. An Interset service module can perform the following tests:+When you have written a driver for a new tagset, you should test it. The driver package contains a test script called ''driver-test.pl''. When running it, give the driver name as argument, without the ''tagset::'' prefix. You can also use the ''-d'' option to turn on debug messages (list of tags being tested). 
 + 
 +<code>driver-test.pl ar::conll 
 +driver-test.pl -a</code> 
 + 
 +Running ''driver-test.pl'' without arguments will list the drivers available on the system. Running it with the ''-a'' option will test all the drivers. 
 + 
 +Note that only drivers implementing the ''list()'' function can be tested. Most testing involves generating the list of all possible tags and testing the driver on each tag separately. 
 + 
 +The following tests will be performed:
  
-  * Get list of possible tags by calling list()Go through the list and check for each tag that encode(decode($tag))=$tag. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of //converting// them.+  * Decode each tag and check that only known features and values are setIn addition to a built-in list, every feature can have an empty value, and the features "tagset" and "other" can have any value. 
 +  * Check for each tag that ''encode(decode($tag)) eq $tag''. While sometimes it can be annoying to try to preserve some obscure information hidden in the tags, this test can also reveal many unwanted bugs. Besides, you should preserve information of your own tagset because people may want to use your driver merely to //access// the tags, instead of //converting// them.
  
-To perform the test, run the script ''driver-test.pl'' in the ''tagset'' root folder. Note that the name of the driver to test is currently hard-coded into the source. In future, it will be changed to a command-line argument. 

[ Back to the navigation ] [ Back to the content ]