Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2010/11/29 23:44] ivanova |
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2010/11/29 23:49] ivanova |
||
---|---|---|---|
Line 16: | Line 16: | ||
The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the vocabulary we might not find all those forms. | The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the vocabulary we might not find all those forms. | ||
The authors of the article introduce three problems and their methods to overcome this challenges: | The authors of the article introduce three problems and their methods to overcome this challenges: | ||
- | (1) common stems are fragmented into many different forms in training data; | + | |
- | (2) rare and unknown words are frequent in test data; | + | 1. common stems are fragmented into many different forms in training data; |
- | (3) spelling variation creates additional sparseness problems. | + | 2. rare and unknown words are frequent in test data; |
+ | 3. spelling variation creates additional sparseness problems. | ||
To solve the indicated problems authors modify training and test aligned bilingual data. | To solve the indicated problems authors modify training and test aligned bilingual data. | ||
Line 24: | Line 25: | ||
The strong side of proposed approaches is that these techniques work for the large training data. | The strong side of proposed approaches is that these techniques work for the large training data. | ||
- | ===== Challenge | + | ===== Challenge 1 ===== |
For the first challenge of vocabulary sparsity they don't intend to do complex morphological analysis, but they apply lightweight technique. | For the first challenge of vocabulary sparsity they don't intend to do complex morphological analysis, but they apply lightweight technique. | ||
Line 43: | Line 44: | ||
- | ===== Challenge | + | ===== Challenge 2 ===== |
To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, algorithm will try to apply more rules to get a word that is in the dictionary. | To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, algorithm will try to apply more rules to get a word that is in the dictionary. | ||
- | There is no information in the article about how the rule is selected in case there are suitable rules for one affix. Probably they have uniform distribution of rules and they leave to a language model to choose one. | + | There is no information in the article about how the rule is selected in case there are several |
- | ===== Challenge | + | ===== Challenge 3 ===== |
- | The third challenge is to correct spelling mistakes. If the word has one spelling | + | The third challenge is to correct spelling mistakes. If the word has one spelling |
It is not clear from the article how exactly they correct the mistakes, for example | It is not clear from the article how exactly they correct the mistakes, for example |