[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2010/11/29 22:44]
ivanova vytvořeno
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2010/11/29 23:32]
ivanova
Line 1: Line 1:
-Overview of the article:+====== Overcoming Vocabulary Sparsity in MT Using Lattices ====== 
 +=== Steve DeNeefe and Ulf Hermjakob and Kevin Knight === 
 + 
 +===== Overview of the article ===== 
 + 
 1. Introduction 1. Introduction
 2. Related work 2. Related work
Line 5: Line 10:
 6. Experiment 6. Experiment
 7. Conclusion 7. Conclusion
 +
 +
 +===== Introduction =====
  
 The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the vocabulary we might not find all those forms. The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the vocabulary we might not find all those forms.
Line 12: Line 20:
 (3) spelling variation creates additional sparseness problems. (3) spelling variation creates additional sparseness problems.
  
-To solve the indicated problems authors modify training and test aligned bilingual data.  +To solve the indicated problems authors modify training and test aligned bilingual data. The strong side of proposed approaches is that these techniques work for the large training data
-For the first challenge the don't intend to do complex morphological analysis, but they apply lightweight technique +
-   +
  
 +===== Challenge (1) =====
  
 +For the first challenge of vocabulary sparsity they don't intend to do complex morphological analysis, but they apply lightweight technique.
 +They split off w- prefix when motivated by the aligned English words and remove sentence-initial w- prefix based on corpus statistics. 
 +The two lists are used:
 +1. A list of English words that correspond to the case when -w is functional (Table 1);
 +2. A list of Arabic words that start with prefix -w.
 +The resulting modified training data is used to train the MT system. The input data is transformed into a lattice containing all possible variants of morphological processing for the w- prefix.
  
 We are not absolutely sure about the terminology of the article. We are not absolutely sure about the terminology of the article.
 In mathematics, a lattice is a partially ordered set in which any two elements have a unique supremum (the elements' least upper bound; called their join) and an infimum (greatest lower bound; called their meet). In mathematics, a lattice is a partially ordered set in which any two elements have a unique supremum (the elements' least upper bound; called their join) and an infimum (greatest lower bound; called their meet).
-The lattice on the figure 1(b) seems to have a direction, so it might be Confusion Network, rather than lattice. +The lattice on the Figure 1(b) seems to have a direction, so it might be Confusion Network, rather than lattice. 
 A Confusion Network (CN), also known as a sausage, is a weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Each edge is labeled with a word and a (posterior) probability.  A Confusion Network (CN), also known as a sausage, is a weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Each edge is labeled with a word and a (posterior) probability. 
 +
 +Also we think that lattices may produce additional errors  so it would be good to have two experiments:
 +one with lattices and one without lattices and compare the results.
 +
 +===== Challenge (2) =====
 +
 +To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, algorithm will try to apply more rules to get a word that is in the dictionary.
 +
 +There is no information in the article about how the rule is selected in case there are suitable rules for one affix. Probably they have uniform distribution of rules and they leave to a language model to choose one.
 +
 +===== Challenge (3) =====
 +
 +The third challenge is to correct spelling mistakes. If the word has one spelling mistakes, they try to correct. But they don't remove the original word, they just add the found options. If the word has more than one spelling mistakes, they do not deal with it. 
 +
 +It is not clear from the article how exactly they correct the mistakes, for example
 +mHAd__t__At - mHAd__v__At
 +Do they have rules that for example probability of substituting __t__ by __v__ is bigger, than probability of substituting __t__ by __a__ ?
 +
 +
 +===== Evaluation =====
 +
 +
 +===== Future work =====
 + 
 +They can work with prefixes b-, l-, Al- and k- using similar approach as for w- prefix.
 +They can look at the context for spelling correction.
 + 
 +
 +   
 +
 +
 +
 +

[ Back to the navigation ] [ Back to the content ]