Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2010/11/29 23:26] ivanova |
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2011/01/09 19:43] (current) kirschner |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Overcoming Vocabulary Sparsity in MT Using Lattices ====== | ====== Overcoming Vocabulary Sparsity in MT Using Lattices ====== | ||
- | Steve DeNeefe and Ulf Hermjakob and Kevin Knight | + | === Steve DeNeefe and Ulf Hermjakob and Kevin Knight |
+ | |||
+ | ===== Overview of the article ===== | ||
- | Overview of the article: | ||
1. Introduction | 1. Introduction | ||
2. Related work | 2. Related work | ||
Line 9: | Line 11: | ||
7. Conclusion | 7. Conclusion | ||
- | The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the vocabulary we might not find all those forms. | ||
- | The authors of the article introduce three problems and their methods to overcome this challenges: | ||
- | (1) common stems are fragmented into many different forms in training data; | ||
- | (2) rare and unknown words are frequent in test data; | ||
- | (3) spelling variation creates additional sparseness problems. | ||
- | To solve the indicated problems authors modify training and test aligned bilingual data. The strong side of proposed approaches is that these techniques work for the large training data | + | ===== Introduction ===== |
- | **(1)** For the first challenge of vocabulary sparsity they don't intend to do complex morphological analysis, but they apply lightweight technique. | + | The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the parallel training data we might not find all those forms. |
+ | The authors of the article introduce three problems and their methods to overcome these challenges: | ||
+ | |||
+ | 1. common stems are fragmented into many different forms in training data; | ||
+ | 2. rare and unknown words are frequent in test data; | ||
+ | 3. spelling variation creates additional sparseness problems. | ||
+ | |||
+ | To solve the indicated problems authors modify training and test aligned bilingual data. | ||
+ | > I think, training data are modified only in the first challenge/ | ||
+ | |||
+ | The strong side of the proposed approaches is that these techniques work for large training data. | ||
+ | |||
+ | ==== Remarks by Martin Kirschner ==== | ||
+ | | ||
+ | | ||
+ | |||
+ | ===== Challenge 1 ===== | ||
+ | |||
+ | For the first challenge of vocabulary sparsity they don't intend to do complex morphological analysis, but they apply lightweight technique. | ||
They split off w- prefix when motivated by the aligned English words and remove sentence-initial w- prefix based on corpus statistics. | They split off w- prefix when motivated by the aligned English words and remove sentence-initial w- prefix based on corpus statistics. | ||
The two lists are used: | The two lists are used: | ||
Line 26: | Line 41: | ||
We are not absolutely sure about the terminology of the article. | We are not absolutely sure about the terminology of the article. | ||
In mathematics, | In mathematics, | ||
- | The lattice | + | > " |
- | A Confusion | + | > [[http:// |
- | Also we think that lattices may produce additional errors | + | The lattice on the Figure 1(b) seems to have a direction, |
- | one with lattices and one without lattices and compare the results. | + | > Some paths are longer than other paths in 1(b), so it is not a confusion network as I defined it above. -MP- |
- | **(2)** To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, algorithm will try to apply more rules to get a word that is in the dictionary. | + | A Confusion Network |
- | There is no information in the article about how the rule is selected in case there are suitable rules for one affix. Probably they have uniform distribution of rules and they leave to a language model to choose | + | Also we think that lattices may produce additional errors |
- | **(3)** The third challenge is to correct spelling mistakes. If the word has one spelling | + | It is not really clear why they don't use lemmatizer instead of splitting. The amount of rules they might need for dealing with all the affixes, along with w- prefix, might be the same as if they wrote lemmatizer. |
+ | > In addition to lemmatizer, they could also use morphological analyzer | ||
+ | |||
+ | ===== Challenge 2 ===== | ||
+ | |||
+ | To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, the algorithm will try to apply more rules to get a word that is in the dictionary. | ||
+ | |||
+ | There is no information in the article about how the rule is selected in case there are several suitable rules for one affix. Probably they have uniform distribution of rules and they leave to a language model to choose one. | ||
+ | |||
+ | |||
+ | ===== Challenge 3 ===== | ||
+ | |||
+ | The third challenge is to correct spelling mistakes. If the word has one spelling | ||
It is not clear from the article how exactly they correct the mistakes, for example | It is not clear from the article how exactly they correct the mistakes, for example | ||
mHAd__t__At - mHAd__v__At | mHAd__t__At - mHAd__v__At | ||
- | Do they have rules that for example probability of substituting __t__ by __v__ is bigger, than probability of substituting __t__ by __a__? | + | It might be that they have rules that for example probability of substituting __t__ by __v__ is bigger, than probability of substituting __t__ by __a__. |
- | Future work | + | ===== Evaluation ===== |
- | They can work with prefixes b-, l-, Al- and k- using similar approach as for w- prefix. | + | |
- | + | ||
- | + | Typo-correction BLEU score on news wire development set is 54.4 which is lower than the baseline in Table 5. So it would be better to add additional evaluation line: all features without typo correction. This could show if typo-correction really helps to improve the final result. Or at least they should have provided human evaluation, e.g. although the BLEU score for the sentence without typo correction is the same as for the same sentence with typo-correction, | |
+ | They aligned their data using LEAF alignment method. We discussed if it was possible to make the same alignment with GIZA++ but came to a conclusion that it is not. | ||
+ | ===== Future work ===== | ||
+ | They can work with prefixes b-, l-, Al- and k- using similar approach as for w- prefix. | ||
+ | They can look at the context for spelling correction. | ||