[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:21]
zeman Conversion testing.
user:zeman:interset:how-to-write-a-driver [2008/03/14 10:56]
zeman Enforcing permitted feature structures.
Line 16: Line 16:
  
 If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it). If the tagset encodes features separately (e.g., each character is a value of a particular feature): The decoder should be tolerant to unexpected combinations of features (or should be able to be tolerant if asked for it).
- 
- 
- 
  
 ===== encode() ===== ===== encode() =====
Line 35: Line 32:
  
 The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus. The list is not necessary for the driver to work. However, it can be useful for [[#Test your driver|testing]] the driver. If no list is distributed along with the tagset description, you may still be able to acquire a partial list from a corpus.
- 
  
 ===== Alternative values ===== ===== Alternative values =====
Line 95: Line 91:
  
 **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if ''NNQW'' means a logical disjunction of the tags ''NNFS'' and ''NNNP'', then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either ''NNFS'' or ''NNNP'' (by storing the other only), you can say that gender = ''F'' or ''N'' and number = ''S'' or ''P'' but by that you have also introduced ''NNFP'' and ''NNNS'' as possibilities. The approach may be revised in future. **Note:** This approach cannot encode situations where some combinations of feature values are plausible and some are not! For instance, if positions [2] and [3] in a tag encode gender and number, respectively, and if ''NNQW'' means a logical disjunction of the tags ''NNFS'' and ''NNNP'', then you cannot encode the situation in DZ Interset precisely. If you do not want to discard either ''NNFS'' or ''NNNP'' (by storing the other only), you can say that gender = ''F'' or ''N'' and number = ''S'' or ''P'' but by that you have also introduced ''NNFP'' and ''NNNS'' as possibilities. The approach may be revised in future.
- 
- 
- 
- 
- 
- 
  
 ===== Replacing feature values with defaults ===== ===== Replacing feature values with defaults =====
Line 136: Line 126:
  
 If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced. If an array is checked, all member values must be permitted in order for the array to be permitted. Otherwise, the array is pruned and the replacement is a subarray where only permitted values are kept. If no member values are permitted (hence the pruned subarray would be empty), the replacement is a single value, the highest-priority replacement of the first element of the array. If the original array was empty (which should never happen but we ought to be careful anyway), the single empty value is checked and possibly replaced.
 +
 +===== Replacing whole feature structures with defaults =====
 +
 +The above technique does not guarantee that the encoder will only see feature //combinations// that it normally gets from its own decoder. For instance, the ''case'' feature may be known in the given tagset but it only occurs with nouns. When encoding features produced by a different driver, there could be non-empty case also for adjectives, pronouns etc. Even though the encoder may be able to encode the new feature combination (using the tagset's usual way of expressing the values "adjective" and, say, "genitive") it is not always desirable (depending on the application) to invent new tags. Default encoding should be **strict,** meaning that only the tags returned by ''list()'' can be returned by the encoder.
 +
 +Similarly to the replacement of separate values, the encoder can ask the Interset common library to replace the whole structure by something the encoder is used to (i.e. by a structure that results from decoding of a tag known by the driver). This usually removes from the encoder the burden of thinking about exotic features and values.
 +
 +The correcting function tries to lie as little as possible. There is a priority value associated with every known feature. Feature values are checked (and possibly altered) in the order of feature priorities. In the above example (adjective cannot have case), the part of speech would keep its "adjective" value, and the case value would be removed. If the case feature had higher priority than the pos feature (which is not the case), a non-empty value of ''case'' would force part of speech to be changed from adjective to noun.
 +
 +<code perl>
 +use tagset::common;
 +...
 +sub list { ... }
 +...
 +BEGIN
 +{
 +    # Store the hash reference in a global variable.
 +    $permitted = tagset::common::get_permitted_structures_joint(list(), \&decode);
 +}
 +...
 +# Give reference to feature structure. Get reference to a new one (deep copy).
 +$fs1 = tagset::common::enforce_permitted_joint($fs0, $permitted);
 +</code>
  
 ===== Common problems ===== ===== Common problems =====

[ Back to the navigation ] [ Back to the content ]