Reading Group Presentation Report from 26.03.2012

Paper: Distributed Word Clustering for Large Scale Class-Based Language Modeling
in Machine Translation by Jakob Uszkoreit and Thorsten Brants
Presented by: Long DT
Report by: Joachim Daiber

A PDF version of this report (with better display of formulas) can be found here.

Overview and Notes from the Paper

Two-side class based model:
P(w_i∣w₁^i - 1) ≈ p₀(w_i∣c(w_i))p₁(c(w_i)∣c(w_{i - n + 1}^i - 1))

Predictive class-based model:
P(w_i∣w₁^i - 1) ≈ p₀(w_i∣c(w_i))p₁(c(w_i)∣w_{i - n + 1}^i - 1)

The main difference is the use of words instead of classes for the history of p₁.

improves complexity, easier to distribute
after a discussion in class which lead to no conclusion, we assumed that the choice was empirical

why do we want to use classes at all? sparseness of the data!
move word from one cluster to another
recompute perplexity
choose exchange that maximizes perplexity
Can one word be assigned to multiple classes? No, this would be fuzzy clustering.
too complicated! Improve: recalculate step-by-step, maintain array (dynamic programming)

O(I ⋅ (2 ⋅ B + N_v ⋅ N_c ⋅ (N_c^pre + N_c^suc)))

N_c^pre + N_c^suc are the average number of clusters preceding and succeeding another cluster
B is the number of distinct bigrams. We maintain an array (dynamic programming). In the formula, we have 2 ⋅ B because as the cost of maintaining this array (2 because of previous word and next word).
I is the number of iterations

(for formula see PDF version or paper)

equations (6)-(10) in the paper demonstrate the new way to calculate the perplexity: when moving a word from c to c’, last part of (10) ( - ∑ _c ∈ CN(c) ⋅ logN(c) ) must be recalculated, the first part ( ∑ _{v ∈ V, c ∈ suc(v)}N(v, c) ⋅ logN(v, c) ) can be quickly calculated with an additional array

O(I ⋅ N_c ⋅ (B + N_v))

The advantage: only two classes affected by a move of a word from one class to another.

divide vocabulary into subsets
each subset is given to one worker
each worker has counts from the previous iteration, workers must synchronize after each iteration

experiments run using the LM in phrase-based machine translation setup
computed BLEU score with different LMs (word-based only and class-based with different numbers of clusters, see table (1) in the paper)
computed BLEU scores for Arabic-English translation with 5-gram predictive class-based model with different inputs (see table (2))
computed BLEU scores for English-Arabic translation with 5-gram predictive class-based model with different inputs (see table (3))

the changes to the algorithm show big performance improvements: there is an improvement in complexity and the model can be used in distributed setting
the model improves quality of state-of-the art machine translation

Do the reported improved results use a model that includes an only word-based model?
- Yes, class-based and word-based model are always combined (two parts in a log-linear model)
- improvement could be to merge the two models in the LM itself, because whether or not to use the class-based model may depend on the history
How exactly does the method distribute data/computations to workers?
- before first iteration
  - sort words, assign to clusters
  - compute counts from clustering
- distribute vocabulary to workers (1/10 each)
  - map: each worker: take out 1/3 (this is an empirical choice! It may not converge when this value is > 1/2) of the vocabulary, compute updates for it
  - there is a tradeoff between the number of iterations and the size of the data you give the workers
    - more iterations: may need to wait for workers, there may be overhead in initializing the workers
  - reduce: combine difference in counts from workers, sum up, map again

What is the predictive class based model?
- similar to two-side class based model but with word instead of class in history of p1, see above
Why do we only consider clusterings for which N(w, c) > 0 or N(c, w) > 0 in the exchange algorithm?
- because only the clusters that satisfy this condition are affected
What’s better in predictive exchange clustering?
- observation: better results for some data sets
- better complexity
- easier to distribute