===== Deciphering Foreign Language =====

Scriber: Ke. T

The talk is about how to tackle MT without parallel training data.

==== Section 1 ====
Given sentence pairs (e,f) where e is an English sentence and f is a foreign sentence, the translation model estimates parameter 
<latex>\theta</latex> such that
<latex>
\mathop {\arg \max }\limits_\theta  \prod\limits_\theta  {p_\theta  (f|e)} 
</latex>

In case we do not have parallel data, we observe foreign text and try to maximize likelihood 
<latex>
\mathop {\arg \max }\limits_\theta  \prod\limits_f {p_\theta  (f)} 
</latex>

Treating English translation as hidden alignment, our task is to find the parameter <latex>\theta</latex> that
<latex>
\mathop {\arg \max }\limits_\theta  \prod\limits_f {\sum\limits_e {P(e) \times \sum\limits_a {P_\theta  (f,a|e)} } } 
</latex>

==== Section 2 ====
Section 2 deals with a simple version of translation, Word Substitution Decipherment, where there is only one-to-one mapping between source string and cipher string (the position of string does not change.)

The solution for this problem is pretty simple: Given a sequence of English tokens <latex>e=e_1,e_2,...,e_n</latex>, and the corresponding sequence of cipher tokens <latex>c=c_1,c_2,...,c_n</latex>, we need to estimate parameter <latex>\theta</latex>:
<latex>
\mathop {\arg \max }\limits_\theta  \prod\limits_c {P_\theta  (c)}  = \mathop {\arg \max }\limits_\theta  \prod\limits_c {\sum\limits_e {P(e) \times \prod\limits_{i = 1}^n {P_\theta  (c_i |e_i )} } } 
</latex>

The key idea of section 2 is the Iterative EM algorithm, which is used to estimate <latex>\theta</latex> more effectively in term of saving memory and running time (complexity).

If we use traditional EM, every time we update <latex>\theta</latex>, we also need to update pseudo-counts (in this case, conditional probabilities <latex>{P_\theta  (c_i |e_i )</latex>.) It leads to O(|V|<sup>2</sup>) time. The heart of Iterative EM is that at every iteration, the algorithm run on a proportion of the most frequent words in vocabulary, and whenever the algorithm estimates <latex>P(c_i|e_i) > 0.5 </latex>, it fixes that probability equal to 1 in the following iteration, hence, the number of free parameters need to be estimated reduce after each iteration.

__**Practical questions:**__ How to initiate EM? How to start the first iteration?

**Some other notes related to this paper:**
  - Generative story: The generative process that generates data given some hidden variables.
  - [[http://en.wikipedia.org/wiki/Chinese_restaurant_process|Chinese restaurant process]]
  - Gibbs sampling:  an algorithm that generates a sequence of samples from the joint probability distribution of two or more random variables.

Why did they experiment with Temporal expression corpus? This corpus has relatively small word types, it makes easier to compare Iterative EM with full EM.

==== Section 3 ====
Not many details of this section was presented, however, there are few discussions around this.

How to choose the best translation? After finishing parameter estimation, pick the final sample and extract the corresponding English translations for every for- eign sentence. This yields the final decipherment output.

Given another text (which is not in training data), how to translate it? Use MLE to find the best translation from the model.

==== Conclusion ====
This is an interesting paper, however, there is a lot of maths behind.