# Reading Group Presentation Report from 26.03.2012

Paper

Distributed Word Clustering for Large Scale Class-Based Language Modeling
in Machine Translation by Jakob Uszkoreit and Thorsten Brants

Presented by

Long DT

Report by

Joachim Daiber

A PDF version of this report (with better display of formulas) can be found here.

# Overview and Notes from the Paper

## Difference between predictive and two-side class-based model

Two-side class based model:
P(wiw1i - 1) ≈ p0(wic(wi))p1(c(wi)∣c(wi - n + 1i - 1))

Predictive class-based model:
P(wiw1i - 1) ≈ p0(wic(wi))p1(c(wi)∣wi - n + 1i - 1)

The main difference is the use of words instead of classes for the history of p1.

### Why does this improve the algorithm?

• improves complexity, easier to distribute
• after a discussion in class which lead to no conclusion, we assumed that the choice was empirical

## Exchange clustering

• why do we want to use classes at all? sparseness of the data!
• move word from one cluster to another
• recompute perplexity
• choose exchange that maximizes perplexity
• Can one word be assigned to multiple classes? No, this would be fuzzy clustering.
• too complicated! Improve: recalculate step-by-step, maintain array (dynamic programming)

## Complexity of Exchange Clustering

O(I ⋅ (2 ⋅ B + Nv ⋅ Nc ⋅ (Ncpre + Ncsuc)))

• Ncpre + Ncsuc are the average number of clusters preceding and succeeding another cluster
• B is the number of distinct bigrams. We maintain an array (dynamic programming). In the formula, we have 2 ⋅ B because as the cost of maintaining this array (2 because of previous word and next word).
• I is the number of iterations

## Predicitive Exchange Clustering

• equations (6)-(10) in the paper demonstrate the new way to calculate the perplexity: when moving a word from c to c’, last part of (10) (  - ∑ c ∈ CN(c) ⋅ logN(c) ) must be recalculated, the first part ( ∑ v ∈ V, c ∈ suc(v)N(v, c) ⋅ logN(v, c) ) can be quickly calculated with an additional array

## New complexity

O(I ⋅ Nc ⋅ (B + Nv))

• B is the number of distinct bigrams
• I is the number of iterations
• Nc is the number of clusters
• Nv is the size of the vocabulary

The advantage: only two classes affected by a move of a word from one class to another.

## Distributed clustering

• divide vocabulary into subsets
• each subset is given to one worker
• each worker has counts from the previous iteration, workers must synchronize after each iteration

## Experiments

• experiments run using the LM in phrase-based machine translation setup
• computed BLEU score with different LMs (word-based only and class-based with different numbers of clusters, see table (1) in the paper)
• computed BLEU scores for Arabic-English translation with 5-gram predictive class-based model with different inputs (see table (2))
• computed BLEU scores for English-Arabic translation with 5-gram predictive class-based model with different inputs (see table (3))

## Conclusion

• the changes to the algorithm show big performance improvements: there is an improvement in complexity and the model can be used in distributed setting
• the model improves quality of state-of-the art machine translation

# Questions discussed in the Reading Group

• Do the reported improved results use a model that includes an only word-based model?

• Yes, class-based and word-based model are always combined (two parts in a log-linear model)
• improvement could be to merge the two models in the LM itself, because whether or not to use the class-based model may depend on the history
• How exactly does the method distribute data/computations to workers?

• before first iteration

• sort words, assign to clusters
• compute counts from clustering
• distribute vocabulary to workers (1/10 each)

• map: each worker: take out 1/3 (this is an empirical choice! It may not converge when this value is > 1/2) of the vocabulary, compute updates for it
• there is a tradeoff between the number of iterations and the size of the data you give the workers

• more iterations: may need to wait for workers, there may be overhead in initializing the workers
• reduce: combine difference in counts from workers, sum up, map again

# Checkpoint questions

• What is the predictive class based model?

• similar to two-side class based model but with word instead of class in history of p1, see above
• Why do we only consider clusterings for which N(w, c) > 0 or N(c, w) > 0 in the exchange algorithm?

• because only the clusters that satisfy this condition are affected
• What’s better in predictive exchange clustering?

• observation: better results for some data sets
• better complexity
• easier to distribute

[ Back to the navigation ] [ Back to the content ] 