This is an old revision of the document!
Table of Contents
Deciphering Foreign Language
Scriber: Ke. T
The talk is about how to tackle MT without parallel training data.
Section 1
Given sentence pairs (e,f) where e is an English sentence and f is a foreign sentence, the translation model estimates parameter
<latex>\theta</latex> such that
<latex>
\mathop {\arg \max }\limits_\theta \prod\limits_\theta {p_\theta (f|e)}
</latex>
In case we do not have parallel data, we observe foreign text and try to maximize likelihood
<latex>
\mathop {\arg \max }\limits_\theta \prod\limits_f {p_\theta (f)}
</latex>
Treating English translation as hidden alignment, our task is to find the parameter <latex>\theta</latex> that
<latex>
\mathop {\arg \max }\limits_\theta \prod\limits_f {\sum\limits_e {P(e) \times \sum\limits_a {P_\theta (f,a|e)} } }
</latex>
Section 2
Section 2 deals with a simple version of translation, Word Substitution Decipherment, where there is only one-to-one mapping between source string and cipher string (the position of string does not change.)
The solution for this problem is pretty simple: Given a sequence of English tokens <latex>e=e_1,e_2,…,e_n</latex>, and the corresponding sequence of cipher tokens <latex>c=c_1,c_2,…,c_n</latex>, we need to estimate parameter <latex>\theta</latex>:
<latex>
\mathop {\arg \max }\limits_\theta \prod\limits_c {P_\theta ©} = \mathop {\arg \max }\limits_\theta \prod\limits_c {\sum\limits_e {P(e) \times \prod\limits_{i = 1}^n {P_\theta (c_i |e_i )} } }
</latex>
The key idea of section 2 is the Iterative EM algorithm, which is used to estimate <latex>\theta</latex> more effectively in term of saving memory and running time (complexity).
If we use traditional EM, every time we update <latex>\theta</latex>, we also need to update pseudo-counts (in this case, conditional probabilities <latex>{P_\theta (c_i |e_i )</latex>.) It leads to O(|V|2) time. The heart of Iterative EM is that at every iteration, the algorithm run on a proportion of the most frequent words in vocabulary, and whenever the algorithm estimates <latex>P(c_i|e_i) > 0.5 </latex>, it fixes that probability equal to 1 in the following iteration, hence, the number of free parameters need to be estimated reduce after each iteration.
Practical questions: How to initiate EM? How to start the first iteration?
Some other notes related to this paper:
- Generative story: The generative process that generates data given some hidden variables.
- Gibbs sampling: an algorithm that generates a sequence of samples from the joint probability distribution of two or more random variables.