====== Alignment by Agreement ======

Percy Liang, Ben Taskar, Dan Klein, [[http://www.seas.upenn.edu/~taskar/pubs/naacl06_aba.pdf|link]]

===== Section 2 -- discussion about previous alignment models =====

IBM Models 1, 2 and HMM alignment model
  * all decompose into a product of p_d (distortion probability) and p_t (translation probability)
  * p_t is lexical translation probability, i.e. described how likely the French word f_j is to be a translation of the English e_i
  * p_d is the probability of French word f_j to be aligned on position a_j, given the sentence length and the alignment of the previous (non-null) word; but we **never condition on both** in these models:
    * IBM Model 1 ignores distortion, so p_d is always 1
    * HMM alignment only conditions on the alignment of the previous word
    * IBM model 2 ignores the alignment of the previous word but takes sentence length into account (uses the position relative to the "diagonal" as an indicator, weighted by c)
  * c is a **function** which gives weights to different distortion values
    * its arguments mean different things depending on the model (see the formula in the top-right of page 2)
    * we plotted its values for IBM model 2 and HMM
  * general discussion on word alignment
    * models in section 2 are assymetric, they are usually run in both directions and then symmetrized
    * symmetrization is a set of heuristic rules for adding alignment points from the union of both alignment directions (we start with the intersection and //grow// towards a less sparse alignment)
    * word alignment can be viewed as a matrix

We can have zero distortion.

If the final symmetrization is intersection, we can't have alignment like this:

1-1 1-2

(And with the joint model introduced in the paper, it should be unlikely).

IBM-2 has the maximum at zero.

Figure 2 -- higher threshold => higher precision => lower 100-precision (=FPR).

Sum of c(pos) - sum of c(neg)? Tested on CzEng.
mean(C_HMM) = 1.45
mean(C_IBM2) = 0.4 (disregarding bucketing), -0.01 (removing floor from the formula)

===== Main algorithm -- section 3. =====

Garbage collector words are aligned to whatever is left -- if they themselves are rare.

Instead of a_j, we look for z : space of all alignments -> probabilities. The
function is universal -- can be used for both sides of alignment.

Define joint objective, instead of:

  log p_1(x ; theta_1) + log p_2(x ; theta_2)

  (thetas are all model parameters)

Use formula (3). It's the inner product of probability vectors (which sum to
one) => it is cosine similarity. We want alignments forward and backward to be
similar.

Latent variable is the alignment. Apply EM, formulated in 3.2. They are however
not guaranteed to maximize (and not even to increase) the value of the objective
function in the M step.

We should try Berkley aligner. It ignores uppercase letters :)