# Reading Group Presentation Report from 26.03.2012

- Paper
Distributed Word Clustering for Large Scale Class-Based Language Modeling

in Machine Translation by Jakob Uszkoreit and Thorsten Brants- Presented by
Long DT

- Report by
Joachim Daiber

# Overview and Notes from the Paper

## Difference between predictive and two-side class-based model

Two-side class based model: *P*(*w*_{i}∣*w*_{1}^{i - 1}) ≈ *p*_{0}(*w*_{i}∣*c*(*w*_{i}))*p*_{1}(*c*(*w*_{i})∣*c*(*w*_{i - n + 1}^{i - 1}))

Predictive class-based model: *P*(*w*_{i}∣*w*_{1}^{i - 1}) ≈ *p*_{0}(*w*_{i}∣*c*(*w*_{i}))*p*_{1}(*c*(*w*_{i})∣*w*_{i - n + 1}^{i - 1})

The main difference is the use of words instead of classes for the history of *p*_{1}.

### Why does this improve the algorithm?

- improves complexity, easier to distribute
- after a discussion in class which lead to no conclusion, we assumed that the choice was empirical

## Exchange clustering

- why do we want to use classes at all? sparseness of the data!
- move word from one cluster to another
- recompute perplexity
- choose exchange that maximizes perplexity
- Can one word be assigned to multiple classes? No, this would be fuzzy clustering.
- too complicated! Improve: recalculate step-by-step, maintain array (dynamic programming)

## Complexity of Exchange Clustering

*O*(*I* ⋅ (2 ⋅ *B* + *N*_{v} ⋅ *N*_{c} ⋅ (*N*_{c}^{pre} + *N*_{c}^{suc})))

*N*_{c}^{pre}+*N*_{c}^{suc}are the average number of clusters preceding and succeeding another cluster*B*is the number of distinct bigrams. We maintain an array (dynamic programming). In the formula, we have 2 ⋅*B*because as the cost of maintaining this array (2 because of previous word and next word).*I*is the number of iterations

## Predicitive Exchange Clustering

(for formula see PDF version or paper)

- equations (6)-(10) in the paper demonstrate the new way to calculate the perplexity: when moving a word from c to c’, last part of (10) ( - ∑
_{c ∈ C}*N*(*c*) ⋅ log*N*(*c*) ) must be recalculated, the first part ( ∑_{v ∈ V, c ∈ suc(v)}*N*(*v*,*c*) ⋅ log*N*(*v*,*c*) ) can be quickly calculated with an additional array

## New complexity

*O*(*I* ⋅ *N*_{c} ⋅ (*B* + *N*_{v}))

*B*is the number of distinct bigrams*I*is the number of iterations*N*_{c}is the number of clusters*N*_{v}is the size of the vocabulary

The advantage: only two classes affected by a move of a word from one class to another.

## Distributed clustering

- divide vocabulary into subsets
- each subset is given to one worker
- each worker has counts from the previous iteration, workers must synchronize after each iteration

## Experiments

- experiments run using the LM in phrase-based machine translation setup
- computed BLEU score with different LMs (word-based only and class-based with different numbers of clusters, see table (1) in the paper)
- computed BLEU scores for Arabic-English translation with 5-gram predictive class-based model with different inputs (see table (2))
- computed BLEU scores for English-Arabic translation with 5-gram predictive class-based model with different inputs (see table (3))

## Conclusion

- the changes to the algorithm show big performance improvements: there is an improvement in complexity and the model can be used in distributed setting
- the model improves quality of state-of-the art machine translation

# Questions discussed in the Reading Group

- Do the reported improved results use a model that includes an only word-based model?
- Yes, class-based and word-based model are always combined (two parts in a log-linear model)
- improvement could be to merge the two models in the LM itself, because whether or not to use the class-based model may depend on the history

- How exactly does the method distribute data/computations to workers?
- before first iteration
- sort words, assign to clusters
- compute counts from clustering

- distribute vocabulary to workers (1/10 each)
- map: each worker: take out 1/3 (this is an empirical choice! It may not converge when this value is > 1/2) of the vocabulary, compute updates for it
- there is a tradeoff between the number of iterations and the size of the data you give the workers
- more iterations: may need to wait for workers, there may be overhead in initializing the workers

- reduce: combine difference in counts from workers, sum up, map again

- before first iteration

# Checkpoint questions

- What is the predictive class based model?
- similar to two-side class based model but with word instead of class in history of p1, see above

- Why do we only consider clusterings for which N(w, c) > 0 or N(c, w) > 0 in the exchange algorithm?
- because only the clusters that satisfy this condition are affected

- What’s better in predictive exchange clustering?
- observation: better results for some data sets
- better complexity
- easier to distribute