Multilingual Noise-Robust Supervised Morphological Analysis using the WordFrame Model
Summary
In this paper the author presents a new supervized method for lemmatization, called WordFrame model.
This new method is compared to existing End-Of-String method and is proven better in most of the cases.
The results are evaulated on 30 different languages with median accuracy 97.5%
The WordFrame model algorithm trains well on noisy data, therefore it can be used in co-training with unsupervised methods.
Described models
Both models described in this paper were ment to decompose the word to some basic parts (not morphemes, but similar).
Extended End-of-String model
Decomposition of inflection into
prefix - concatenation of all prefixes
primary common substring - the stem
point of suffixation change - phonologicaly induced letter change on the boundary of stem and suffix
suffix/ending - concatenation of all suffixes of the word
WordFrame model
Decomposition of inflection into
prefix - concatenation of all prefixes
point of prefixation change - phonologicaly induced letter change on the boundary of first part of stem and prefix
secondary common substring - the part of stem before stem vowel change
vowel change - the vowel change inside the stem
primary common substring - the part of stem after the vowel change
point of suffixation change - phonologicaly induced letter change on the boundary of stem and suffix
suffix/ending - concatenation of all suffixes of the word
Suggested Additional Reading
What do we like about the paper
What do we dislike about the paper
Doesn't do morphological analysis, only lemmatization
Experiments done only on verbs
The paper doesn't say, what option the algorithm selects if there are more possible correct results
The algorithm only uses features based only on the word itself, it doesn't use context
With information given in this paper, we wouldn't be able to create a program to review the results
Questions
Does the term point of prefixation mean the same as the term morpheme boundary?
In section 4 of the paper - experimental results presented were done using 10-fold cross validation
in section 4.1, Table 5 - shoudn't the WF model give allvays better results than the EOS model? The division of the word in EOS model is simplified division in WF model. Isn't it?
Written by Martin Kirschner