Reading Group Presentation Report from 26.03.2012
- Paper
Distributed Word Clustering for Large Scale Class-Based Language Modeling
in Machine Translation by Jakob Uszkoreit and Thorsten Brants
- Presented by
Long DT
- Report by
Joachim Daiber
A PDF version of this report (with better display of formulas) can be found here.
Overview and Notes from the Paper
Difference between predictive and two-side class-based model
Two-side class based model:
P(wi∣w1i - 1) ≈ p0(wi∣c(wi))p1(c(wi)∣c(wi - n + 1i - 1))
Predictive class-based model:
P(wi∣w1i - 1) ≈ p0(wi∣c(wi))p1(c(wi)∣wi - n + 1i - 1)
The main difference is the use of words instead of classes for the history of p1.
Why does this improve the algorithm?
- improves complexity, easier to distribute
- after a discussion in class which lead to no conclusion, we assumed that the choice was empirical
Exchange clustering
- why do we want to use classes at all? sparseness of the data!
- move word from one cluster to another
- recompute perplexity
- choose exchange that maximizes perplexity
- Can one word be assigned to multiple classes? No, this would be fuzzy clustering.
- too complicated! Improve: recalculate step-by-step, maintain array (dynamic programming)
Complexity of Exchange Clustering
O(I ⋅ (2 ⋅ B + Nv ⋅ Nc ⋅ (Ncpre + Ncsuc)))
- Ncpre + Ncsuc are the average number of clusters preceding and succeeding another cluster
- B is the number of distinct bigrams. We maintain an array (dynamic programming). In the formula, we have 2 ⋅ B because as the cost of maintaining this array (2 because of previous word and next word).
- I is the number of iterations
Predicitive Exchange Clustering
(for formula see PDF version or paper)
- equations (6)-(10) in the paper demonstrate the new way to calculate the perplexity: when moving a word from c to c’, last part of (10) ( - ∑ c ∈ CN(c) ⋅ logN(c) ) must be recalculated, the first part ( ∑ v ∈ V, c ∈ suc(v)N(v, c) ⋅ logN(v, c) ) can be quickly calculated with an additional array
New complexity
O(I ⋅ Nc ⋅ (B + Nv))
- B is the number of distinct bigrams
- I is the number of iterations
- Nc is the number of clusters
- Nv is the size of the vocabulary
The advantage: only two classes affected by a move of a word from one class to another.
Distributed clustering
- divide vocabulary into subsets
- each subset is given to one worker
- each worker has counts from the previous iteration, workers must synchronize after each iteration
Experiments
- experiments run using the LM in phrase-based machine translation setup
- computed BLEU score with different LMs (word-based only and class-based with different numbers of clusters, see table (1) in the paper)
- computed BLEU scores for Arabic-English translation with 5-gram predictive class-based model with different inputs (see table (2))
- computed BLEU scores for English-Arabic translation with 5-gram predictive class-based model with different inputs (see table (3))
Conclusion
- the changes to the algorithm show big performance improvements: there is an improvement in complexity and the model can be used in distributed setting
- the model improves quality of state-of-the art machine translation
Questions discussed in the Reading Group
- Do the reported improved results use a model that includes an only word-based model?
- Yes, class-based and word-based model are always combined (two parts in a log-linear model)
- improvement could be to merge the two models in the LM itself, because whether or not to use the class-based model may depend on the history
- How exactly does the method distribute data/computations to workers?
- before first iteration
- sort words, assign to clusters
- compute counts from clustering
- distribute vocabulary to workers (1/10 each)
- map: each worker: take out 1/3 (this is an empirical choice! It may not converge when this value is > 1/2) of the vocabulary, compute updates for it
- there is a tradeoff between the number of iterations and the size of the data you give the workers
- more iterations: may need to wait for workers, there may be overhead in initializing the workers
- reduce: combine difference in counts from workers, sum up, map again
Checkpoint questions
- What is the predictive class based model?
- similar to two-side class based model but with word instead of class in history of p1, see above
- Why do we only consider clusterings for which N(w, c) > 0 or N(c, w) > 0 in the exchange algorithm?
- because only the clusters that satisfy this condition are affected
- What’s better in predictive exchange clustering?
- observation: better results for some data sets
- better complexity
- easier to distribute