[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2010/11/29 23:46]
ivanova
courses:rg:overcoming_vocabulary_sparsity_in_mt_using_lattices [2011/01/09 19:43] (current)
kirschner
Line 14: Line 14:
 ===== Introduction ===== ===== Introduction =====
  
-The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the vocabulary we might not find all those forms. +The article is about overcoming the problem of vocabulary sparsity in SMT. The sparsity occurs because many words can have inflection or can take different affixes while in the parallel training data we might not find all those forms. 
-The authors of the article introduce three problems and their methods to overcome this challenges:+The authors of the article introduce three problems and their methods to overcome these challenges:
  
 1. common stems are fragmented into many different forms in training data; 1. common stems are fragmented into many different forms in training data;
Line 21: Line 21:
 3. spelling variation creates additional sparseness problems. 3. spelling variation creates additional sparseness problems.
  
-To solve the indicated problems authors modify training and test aligned bilingual data. +To solve the indicated problems authors modify training and test aligned bilingual data. 
 +> I think, training data are modified only in the first challenge/task (common stems). -MP-
  
-The strong side of proposed approaches is that these techniques work for the large training data.+The strong side of the proposed approaches is that these techniques work for large training data. 
 + 
 +==== Remarks by Martin Kirschner ==== 
 +  * The experiments and evaluation are done on Arabic-English translation pair. 
 +  * Section 2 of the paper: //Many of the above works use morphological toolkits, while in this work we explore lightweight techniques that use the parallel data as the main source of information. We are able to combine both linguistic and statistical sources knowledge and then train the system to select which information it will use at decoding time.// - What is the difference between morphological toolkits and linguistic and statistical sources of knowledge? Is it the reason why the don't use lemmatizer (as mentioned above), to do the work mork lightweight?
  
 ===== Challenge 1 ===== ===== Challenge 1 =====
Line 36: Line 41:
 We are not absolutely sure about the terminology of the article. We are not absolutely sure about the terminology of the article.
 In mathematics, a lattice is a partially ordered set in which any two elements have a unique supremum (the elements' least upper bound; called their join) and an infimum (greatest lower bound; called their meet). In mathematics, a lattice is a partially ordered set in which any two elements have a unique supremum (the elements' least upper bound; called their join) and an infimum (greatest lower bound; called their meet).
-The lattice on the Figure 1(b) seems to have a direction, so it might be Confusion Network, rather than lattice.  +> "Lattice" is an overloaded term. There are at least [[http://en.wikipedia.org/wiki/Lattice_%28mathematics%29|three definitions in mathematics]], but the lattice used in automatic speech recognition and machine translation is still something different -- a connected directed acyclic graph, with edges labelled by words and possibly weighted. 
-A Confusion Network (CN), also known as a sausage, is a weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Each edge is labeled with a word and a (posterior) probability. +> [[http://www.statmt.org/moses/?n=Moses.ConfusionNetworks|Confusion network]] is a special case (or simplification) of a lattice, where each path from the start node to the end node goes through all the other nodes. -MP- 
 + 
 +The lattice on the Figure 1(b) seems to have a direction, so it might be Confusion Network, rather than lattice. 
 +> Some paths are longer than other paths in 1(b), so it is not a confusion network as I defined it above. -MP- 
 + 
 +A Confusion Network (CN), also known as a sausage, is a weighted directed graph with the peculiarity that each path from the start node to the end node goes through all the other nodes. Each edge is labeled with a word and a (posterior) probability.
  
 Also we think that lattices may produce additional errors and it is computationally more expensive to work with lattices than with plain strings, so it would be good to have two experiments: one with lattices and one without lattices and compare the results. Also we think that lattices may produce additional errors and it is computationally more expensive to work with lattices than with plain strings, so it would be good to have two experiments: one with lattices and one without lattices and compare the results.
  
 It is not really clear why they don't use lemmatizer instead of splitting. The amount of rules they might need for dealing with all the affixes, along with w- prefix, might be the same as if they wrote lemmatizer. It is not really clear why they don't use lemmatizer instead of splitting. The amount of rules they might need for dealing with all the affixes, along with w- prefix, might be the same as if they wrote lemmatizer.
 +> In addition to lemmatizer, they could also use morphological analyzer (but they don't have to do the disambiguation, they can store multiple tags (+probabilities) for one word in the lattice). I see the advantage of this approach in the possibility of using [[http://www.aclweb.org/anthology/D/D07/D07-1091.pdf|factored translation models]]. -MP-
  
 ===== Challenge 2 ===== ===== Challenge 2 =====
  
-To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, algorithm will try to apply more rules to get a word that is in the dictionary.+To translate rare and unknown words that are not in the dictionary the authors use 193 hand-written linguistic rules about how to cut-off affixes and get rid of inflection. The word that we get after cutting off the affix, might be in the dictionary, if not, the algorithm will try to apply more rules to get a word that is in the dictionary.
  
-There is no information in the article about how the rule is selected in case there are suitable rules for one affix. Probably they have uniform distribution of rules and they leave to a language model to choose one.+There is no information in the article about how the rule is selected in case there are several suitable rules for one affix. Probably they have uniform distribution of rules and they leave to a language model to choose one.
  
  
 ===== Challenge 3 ===== ===== Challenge 3 =====
  
-The third challenge is to correct spelling mistakes. If the word has one spelling mistakes, they try to correct. But they don't remove the original word, they just add the found options. If the word has more than one spelling mistakes, they do not deal with it. +The third challenge is to correct spelling mistakes. If the word has one spelling mistake, they try to correct it. But they don't remove the original word, they just add the found options. If the word has more than one spelling mistake, they do not deal with it. 
  
 It is not clear from the article how exactly they correct the mistakes, for example It is not clear from the article how exactly they correct the mistakes, for example
Line 62: Line 72:
 ===== Evaluation ===== ===== Evaluation =====
  
-Typo-correct lower BLEU score on news wire development set is 54.4 which is lower than baseline in Table 5.  So it would be better to add additional evaluation line: all features without typo correction. This could show if typo-correction really helps to improve the final result. Or at least they should have provided human evaluation, e.g. although the BLEU score for the sentence without typo correction is the same as for the same sentence with typo-correction, the sentence looks better for a human when the typo is corrected.+Typo-correction BLEU score on news wire development set is 54.4 which is lower than the baseline in Table 5.  So it would be better to add additional evaluation line: all features without typo correction. This could show if typo-correction really helps to improve the final result. Or at least they should have provided human evaluation, e.g. although the BLEU score for the sentence without typo correction is the same as for the same sentence with typo-correction, the sentence looks better for a human when the typo is corrected.
  
 They aligned their data using LEAF alignment method. We discussed if it was possible to make the same alignment with GIZA++ but came to a conclusion that it is not. They aligned their data using LEAF alignment method. We discussed if it was possible to make the same alignment with GIZA++ but came to a conclusion that it is not.
Line 70: Line 80:
 They can work with prefixes b-, l-, Al- and k- using similar approach as for w- prefix. They can work with prefixes b-, l-, Al- and k- using similar approach as for w- prefix.
 They can look at the context for spelling correction. They can look at the context for spelling correction.
-  
- 
-    
- 
- 
- 
  

[ Back to the navigation ] [ Back to the content ]