Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:jodaiberreport [2012/03/26 18:40]
joda vytvořeno
+++ courses:rg:2012:jodaiberreport [2012/03/26 18:51] (current)
joda
@@ Line 13: / Line 13: @@
 </dd>
 </dl>
+A PDF version of this report (with better display of formulas) can be found <a href="http://jodaiber.de/prague/rg_report.pdf">here</a>.
 <h1 id="overview-and-notes-from-the-paper">Overview and Notes from the Paper</h1>
 <h2 id="difference-between-predictive-and-two-side-class-based-model">Difference between predictive and two-side class-based model</h2>
@@ Line 19: / Line 22: @@
 <p>The main difference is the use of words instead of classes for the history of <span class="math"><em>p</em><sub>1</sub></span>.</p>
 <h3 id="why-does-this-improve-the-algorithm">Why does this improve the algorithm?</h3>
-<ul>
+<ul style="color: black;">
-<li><p>improves complexity, easier to distribute</p></li>
+<li>improves complexity, easier to distribute</li>
-<li><p>after a discussion in class which lead to no conclusion, we assumed that the choice was empirical</p></li>
+<li>after a discussion in class which lead to no conclusion, we assumed that the choice was empirical</li>
 </ul>
 <h2 id="exchange-clustering">Exchange clustering</h2>
-<ul>
+<ul style="color: black;">
-<li><p>why do we want to use classes at all? sparseness of the data!</p></li>
+<li>why do we want to use classes at all? sparseness of the data!</li>
-<li><p>move word from one cluster to another</p></li>
+<li>move word from one cluster to another</li>
-<li><p>recompute perplexity</p></li>
+<li>recompute perplexity</li>
-<li><p>choose exchange that maximizes perplexity</p></li>
+<li>choose exchange that maximizes perplexity</li>
-<li><p>Can one word be assigned to multiple classes? No, this would be fuzzy clustering.</p></li>
+<li>Can one word be assigned to multiple classes? No, this would be fuzzy clustering.</li>
-<li><p>too complicated! Improve: recalculate step-by-step, maintain array (dynamic programming)</p></li>
+<li>too complicated! Improve: recalculate step-by-step, maintain array (dynamic programming)</li>
 </ul>
 <h2 id="complexity-of-exchange-clustering">Complexity of Exchange Clustering</h2>
 <p><br /><span class="math"><em>O</em>(<em>I</em> ⋅ (2 ⋅ <em>B</em> + <em>N</em><sub><em>v</em></sub> ⋅ <em>N</em><sub><em>c</em></sub> ⋅ (<em>N</em><sub><em>c</em></sub><sup><em>p</em><em>r</em><em>e</em></sup> + <em>N</em><sub><em>c</em></sub><sup><em>s</em><em>u</em><em>c</em></sup>)))</span><br /></p>
-<ul>
+<ul style="color: black;">
-<li><p><span class="math"><em>N</em><sub><em>c</em></sub><sup><em>p</em><em>r</em><em>e</em></sup> + <em>N</em><sub><em>c</em></sub><sup><em>s</em><em>u</em><em>c</em></sup></span> are the average number of clusters preceding and succeeding another cluster</p></li>
+<li><span class="math"><em>N</em><sub><em>c</em></sub><sup><em>p</em><em>r</em><em>e</em></sup> + <em>N</em><sub><em>c</em></sub><sup><em>s</em><em>u</em><em>c</em></sup></span> are the average number of clusters preceding and succeeding another cluster</li>
-<li><p><span class="math"><em>B</em></span> is the number of distinct bigrams. We maintain an array (dynamic programming). In the formula, we have <span class="math">2 ⋅ <em>B</em></span> because as the cost of maintaining this array (2 because of previous word and next word).</p></li>
+<li><span class="math"><em>B</em></span> is the number of distinct bigrams. We maintain an array (dynamic programming). In the formula, we have <span class="math">2 ⋅ <em>B</em></span> because as the cost of maintaining this array (2 because of previous word and next word).</li>
-<li><p><span class="math"><em>I</em></span> is the number of iterations</p></li>
+<li><span class="math"><em>I</em></span> is the number of iterations</li>
 </ul>
 <h2 id="predicitive-exchange-clustering">Predicitive Exchange Clustering</h2>
-<p><br /><span class="math">$
+<p>(for formula see PDF version or paper)</p>
-    P({w_i}|w_1^{i - 1}) \approx {p_0}({w_i}|c({w_i})) \cdot {p_1}(c({w_i}|{w_{i - 1}})) = \frac{{N({w_i})}}{{N(c({w_i}))}} \cdot \frac{{N({w_{i - 1}},c({w_i})}}{{N({w_{i - 1}})}}
+<ul style="color: black;">
-    $</span><br /></p>
+<li>equations (6)-(10) in the paper demonstrate the new way to calculate the perplexity: when moving a word from c to c’, last part of (10) ( <span class="math"> - ∑ <sub><em>c</em> ∈ <em>C</em></sub><em>N</em>(<em>c</em>) ⋅ log<em>N</em>(<em>c</em>)</span> ) must be recalculated, the first part ( <span class="math">∑ <sub><em>v</em> ∈ <em>V</em>, <em>c</em> ∈ <em>s</em><em>u</em><em>c</em>(<em>v</em>)</sub><em>N</em>(<em>v</em>, <em>c</em>) ⋅ log<em>N</em>(<em>v</em>, <em>c</em>)</span> ) can be quickly calculated with an additional array</li>
-<ul>
-<li><p>equations (6)-(10) in the paper demonstrate the new way to calculate the perplexity: when moving a word from c to c’, last part of (10) ( <span class="math"> - ∑ <sub><em>c</em> ∈ <em>C</em></sub><em>N</em>(<em>c</em>) ⋅ log<em>N</em>(<em>c</em>)</span> ) must be recalculated, the first part ( <span class="math">∑ <sub><em>v</em> ∈ <em>V</em>, <em>c</em> ∈ <em>s</em><em>u</em><em>c</em>(<em>v</em>)</sub><em>N</em>(<em>v</em>, <em>c</em>) ⋅ log<em>N</em>(<em>v</em>, <em>c</em>)</span> ) can be quickly calculated with an additional array</p></li>
 </ul>
 <h2 id="new-complexity">New complexity</h2>
 <p><br /><span class="math"><em>O</em>(<em>I</em> ⋅ <em>N</em><sub><em>c</em></sub> ⋅ (<em>B</em> + <em>N</em><sub><em>v</em></sub>))</span><br /></p>
-<ul>
+<ul style="color: black;">
-<li><p><span class="math"><em>B</em></span> is the number of distinct bigrams</p></li>
+<li><span class="math"><em>B</em></span> is the number of distinct bigrams</li>
-<li><p><span class="math"><em>I</em></span> is the number of iterations</p></li>
+<li><span class="math"><em>I</em></span> is the number of iterations</li>
-<li><p><span class="math"><em>N</em><sub><em>c</em></sub></span> is the number of clusters</p></li>
+<li><span class="math"><em>N</em><sub><em>c</em></sub></span> is the number of clusters</li>
-<li><p><span class="math"><em>N</em><sub><em>v</em></sub></span> is the size of the vocabulary</p></li>
+<li><span class="math"><em>N</em><sub><em>v</em></sub></span> is the size of the vocabulary</li>
 </ul>
 <p>The advantage: only two classes affected by a move of a word from one class to another.</p>
 <h2 id="distributed-clustering">Distributed clustering</h2>
-<ul>
+<ul style="color: black;">
-<li><p>divide vocabulary into subsets</p></li>
+<li>divide vocabulary into subsets</li>
-<li><p>each subset is given to one worker</p></li>
+<li>each subset is given to one worker</li>
-<li><p>each worker has counts from the previous iteration, workers must synchronize after each iteration</p></li>
+<li>each worker has counts from the previous iteration, workers must synchronize after each iteration</li>
 </ul>
 <h2 id="experiments">Experiments</h2>
-<ul>
+<ul style="color: black;">
-<li><p>experiments run using the LM in phrase-based machine translation setup</p></li>
+<li>experiments run using the LM in phrase-based machine translation setup</li>
-<li><p>computed BLEU score with different LMs (word-based only and class-based with different numbers of clusters, see table (1) in the paper)</p></li>
+<li>computed BLEU score with different LMs (word-based only and class-based with different numbers of clusters, see table (1) in the paper)</li>
-<li><p>computed BLEU scores for Arabic-English translation with 5-gram predictive class-based model with different inputs (see table (2))</p></li>
+<li>computed BLEU scores for Arabic-English translation with 5-gram predictive class-based model with different inputs (see table (2))</li>
-<li><p>computed BLEU scores for English-Arabic translation with 5-gram predictive class-based model with different inputs (see table (3))</p></li>
+<li>computed BLEU scores for English-Arabic translation with 5-gram predictive class-based model with different inputs (see table (3))</li>
 </ul>
 <h2 id="conclusion">Conclusion</h2>
-<ul>
+<ul style="color: black;">
-<li><p>the changes to the algorithm show big performance improvements: there is an improvement in complexity and the model can be used in distributed setting</p></li>
+<li>the changes to the algorithm show big performance improvements: there is an improvement in complexity and the model can be used in distributed setting</li>
-<li><p>the model improves quality of state-of-the art machine translation</p></li>
+<li>the model improves quality of state-of-the art machine translation</li>
 </ul>
 <h1 id="questions-discussed-in-the-reading-group">Questions discussed in the Reading Group</h1>
-<ul>
+<ul style="color: black;">
-<li><p>Do the reported improved results use a model that includes an only word-based model?</p>
+<li>Do the reported improved results use a model that includes an only word-based model?</p>
-<ul>
+<ul style="color: black;">
-<li><p>Yes, class-based and word-based model are always combined (two parts in a log-linear model)</p></li>
+<li>Yes, class-based and word-based model are always combined (two parts in a log-linear model)</li>
-<li><p>improvement could be to merge the two models in the LM itself, because whether or not to use the class-based model may depend on the history</p></li>
+<li>improvement could be to merge the two models in the LM itself, because whether or not to use the class-based model may depend on the history</li>
 </ul></li>
-<li><p>How exactly does the method distribute data/computations to workers?</p>
+<li>How exactly does the method distribute data/computations to workers?</p>
-<ul>
+<ul style="color: black;">
-<li><p>before first iteration</p>
+<li>before first iteration</p>
-<ul>
+<ul style="color: black;">
-<li><p>sort words, assign to clusters</p></li>
+<li>sort words, assign to clusters</li>
-<li><p>compute counts from clustering</p></li>
+<li>compute counts from clustering</li>
 </ul></li>
-<li><p>distribute vocabulary to workers (1/10 each)</p>
+<li>distribute vocabulary to workers (1/10 each)</p>
-<ul>
+<ul style="color: black;">
-<li><p>map: each worker: take out 1/3 (this is an empirical choice! It may not converge when this value is &gt; 1/2) of the vocabulary, compute updates for it</p></li>
+<li>map: each worker: take out 1/3 (this is an empirical choice! It may not converge when this value is &gt; 1/2) of the vocabulary, compute updates for it</li>
-<li><p>there is a tradeoff between the number of iterations and the size of the data you give the workers</p>
+<li>there is a tradeoff between the number of iterations and the size of the data you give the workers</p>
-<ul>
+<ul style="color: black;">
-<li><p>more iterations: may need to wait for workers, there may be overhead in initializing the workers</p></li>
+<li>more iterations: may need to wait for workers, there may be overhead in initializing the workers</li>
 </ul></li>
-<li><p>reduce: combine difference in counts from workers, sum up, map again</p></li>
+<li>reduce: combine difference in counts from workers, sum up, map again</li>
 </ul></li>
 </ul></li>
 </ul>
 <h1 id="checkpoint-questions">Checkpoint questions</h1>
-<ul>
+<ul style="color: black;">
-<li><p>What is the predictive class based model?</p>
+<li>What is the predictive class based model?</p>
-<ul>
+<ul style="color: black;">
-<li><p>similar to two-side class based model but with word instead of class in history of p1, see above</p></li>
+<li>similar to two-side class based model but with word instead of class in history of p1, see above</li>
 </ul></li>
-<li><p>Why do we only consider clusterings for which N(w, c) &gt; 0 or N(c, w) &gt; 0 in the exchange algorithm?</p>
+<li>Why do we only consider clusterings for which N(w, c) &gt; 0 or N(c, w) &gt; 0 in the exchange algorithm?</p>
-<ul>
+<ul style="color: black;">
-<li><p>because only the clusters that satisfy this condition are affected</p></li>
+<li>because only the clusters that satisfy this condition are affected</li>
 </ul></li>
-<li><p>What’s better in predictive exchange clustering?</p>
+<li>What’s better in predictive exchange clustering?</p>
-<ul>
+<ul style="color: black;">
-<li><p>observation: better results for some data sets</p></li>
+<li>observation: better results for some data sets</li>
-<li><p>better complexity</p></li>
+<li>better complexity</li>
-<li><p>easier to distribute</p></li>
+<li>easier to distribute</li>
 </ul></li>
 </ul>
 </html>

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences