[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
user:zeman:interset:common-problems [2008/04/04 15:48]
zeman Proposal for future versions.
user:zeman:interset:common-problems [2008/04/04 15:59]
zeman Refined future model.
Line 52: Line 52:
  
 Chinese 的 (de) has a part of speech of its own in the Sinica treebank, ''​DE''​. The easiest approach is to decode it as a particle and remember its special nature using a new ''​subpos''​ value, or just storing it in the ''​other''​ feature. However, //​de//'​s usage could be compared to that of conjunctions. (It's not a coordinative conjunction,​ though. It connects two elements with different roles: often a possessor and the possessed object, e.g. 我的腦海.) Chinese 的 (de) has a part of speech of its own in the Sinica treebank, ''​DE''​. The easiest approach is to decode it as a particle and remember its special nature using a new ''​subpos''​ value, or just storing it in the ''​other''​ feature. However, //​de//'​s usage could be compared to that of conjunctions. (It's not a coordinative conjunction,​ though. It connects two elements with different roles: often a possessor and the possessed object, e.g. 我的腦海.)
 +
  
  
Line 73: Line 74:
 The correct solution would be to decode such tag into multiple parallel feature structures. Every structure would only contain single values, no arrays. This would remove one level of complexity inside the structures but add another level around the structure. We can consider making this change in a future version of Interset. There could be two interfaces to the decoding function: one that would output an array of (references to) feature structures, and the other that would output (reference to) just one feature structure but there would be an additional feature with reference to the next feature structure. The encoder would select the structure that requires the least modification to fit the target tagset. If the user can deal with more than one target tag, they would ask for encoding each of them separately. If the target tagset could accommodate alternate values in some features, the encoder could look at multiple structures at a time; it is unclear how this would be done. The correct solution would be to decode such tag into multiple parallel feature structures. Every structure would only contain single values, no arrays. This would remove one level of complexity inside the structures but add another level around the structure. We can consider making this change in a future version of Interset. There could be two interfaces to the decoding function: one that would output an array of (references to) feature structures, and the other that would output (reference to) just one feature structure but there would be an additional feature with reference to the next feature structure. The encoder would select the structure that requires the least modification to fit the target tagset. If the user can deal with more than one target tag, they would ask for encoding each of them separately. If the target tagset could accommodate alternate values in some features, the encoder could look at multiple structures at a time; it is unclear how this would be done.
  
-Multiple alternate feature structures can also be stored in a packed form. There is only one structure. It has an additional feature called ''​clusters''​. Its value is an array of substructures (clusters) C1..Cn. If a feature is set in a cluster, it must be empty in the main structure (or there can be the reference to the clusters or another special value). If there are two disjunct independent subsets of features (there are dependencies between the features in one subset but not between the subsets), we must write all combinations of possible values from all affected subsets (as if there were dependencies between the subsets). Adding a layer to separate the independent subsets would only complicate processing.+Multiple alternate feature structures can also be stored in a packed form. There is only one structure. It has an additional feature called ''​_clusters_''​. Its value is an array of substructures (clusters) C1..Cn. If a feature is set in a cluster, it must be empty in the main structure (or there can be the reference to the clusters or another special value). If there are two disjunct independent subsets of features (there are dependencies between the features in one subset but not between the subsets), we must write all combinations of possible values from all affected subsets (as if there were dependencies between the subsets). Adding a layer to separate the independent subsets would only complicate processing.
  
 <code perl>%f = <code perl>%f =
Line 84: Line 85:
     '​degree' ​      => "​pos",​     '​degree' ​      => "​pos",​
     '​negativeness'​ => "​pos",​     '​negativeness'​ => "​pos",​
-    '_clusters' ​   =>+    '_clusters_' ​   =>
     [     [
         {'​gender'​ => "​masc",​ '​animateness'​ => "​inan"​},​         {'​gender'​ => "​masc",​ '​animateness'​ => "​inan"​},​
         {'​gender'​ => "​fem"​}         {'​gender'​ => "​fem"​}
     ]     ]
 +);</​code>​
 +
 +Or the alternating features would directly reference the array of substructures containing their alternating values. No additional technical feature would be needed. There could be more than one independent arrays of substructures if needed. The main structure would look the same as today, including the possibility that a feature value is an array reference. However, the referenced array would not contain directly the values of the referencing feature. Instead, it would contain (references to) substructures (hashes), each of them defining a single value for this and possibly other features.
 +
 +<code perl>​@alt =
 +(
 +    {'​gender'​ => "​masc",​ '​animateness'​ => "​inan"​},​
 +    {'​gender'​ => "​fem"​}
 +);
 +%f =
 +(
 +    '​pos' ​         => "​adj",​
 +    '​gender' ​      => \@alt,
 +    '​animateness' ​ => \@alt,
 +    '​number' ​      => "​plu",​
 +    '​case' ​        => "​nom",​
 +    '​degree' ​      => "​pos",​
 +    '​negativeness'​ => "​pos",​
 );</​code>​ );</​code>​

[ Back to the navigation ] [ Back to the content ]