Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:encouraging-consistent-translation [2012/10/16 15:05]
dusek
+++ courses:rg:2012:encouraging-consistent-translation [2012/10/17 11:44]
dusek
@@ Line 8: / Line 8: @@
 The list of discussed topics follows the outline of the paper:
 ==== Sec. 2. Related Work ====
-  * **Differences from Carpuat 2009**
+**Differences from Carpuat 2009**
-    * Yes: the decoder just gets additional features, but the decision is up to it -- Carpuat 2009 just post-edits the outputs and substitutes the most likely variant everywhere
+  * It is different: the decoder just gets additional features, but the decision is up to it -- Carpuat 2009 just post-edits the outputs and substitutes the most likely variant everywhere
-      * Using Carpuat 2009's approach directly in the decoder would influence neighboring words through LM, so even using this in the decoder and not as post-editing leads to a different outcome
+    * Using Carpuat 2009's approach directly in the decoder would influence neighboring words through LM, so even using this in the decoder and not as post-editing leads to a different outcome
-  * **Human translators and one sense per discourse**
-    * This suggests that modelling human translators is the same as modelling one sense per discourse -- this is suspicious
+**Human translators and one sense per discourse**
-      * The authors do not state their evidence clearly.
+  * This suggests that modelling human translators is the same as modelling one sense per discourse -- this is suspicious
-      * One sense is not the same as one translation
+    * The authors do not state their evidence clearly.
+    * One sense is not the same as one translation
 ==== Sec. 3. Exploratory analysis ====
-  * **Hiero**
+**Hiero**
-    * The idea would most probably work the same in normal phrase-based SMT, but the authors use hierarchical phrase-based translation (Hiero)
+  * The idea would most probably work the same in normal phrase-based SMT, but the authors use hierarchical phrase-based translation (Hiero)
-      * Hiero is summarized in Fig. 1: the phrases may contain non-terminals (''X'', ''X1'' etc.), which leads to a probabilistic CFG and bottom-up parsing
+    * Hiero is summarized in Fig. 1: the phrases may contain non-terminals (''X'', ''X1'' etc.), which leads to a probabilistic CFG and bottom-up parsing
-    * The authors chose the ''cdec'' implementation of Hiero (which is implemented in several systems: Moses, cdec, Joshua etc.)
+  * The authors chose the ''cdec'' implementation of Hiero (which is implemented in several systems: Moses, cdec, Joshua etc.)
-      * The choice was probably arbitrary, other systems would yield similar results
+    * The choice was probably arbitrary, other systems would yield similar results
-  * **Forced decoding**
-    * This means that the decoder is given source //and// target sentence and has to provide the rules/phrases that map from the source to the target
+**Forced decoding**
-      * The decoder might be unable to find the appropriate rules (for unseen words)
+  * This means that the decoder is given source //and// target sentence and has to provide the rules/phrases that map from the source to the target
-      * It is a different decoder mode, for which it must be adjusted
+    * The decoder might be unable to find the appropriate rules (for unseen words)
-      * Forced decoding is much more informative for Hiero translations than for "plain" phrase-based ones, since there are many different parse trees that yield the same target string, and not as much phrases
+    * It is a different decoder mode, for which it must be adjusted
+    * Forced decoding is much more informative for Hiero translations than for "plain" phrase-based ones, since there are many different parse trees that yield the same target string, and not as much phrases
+**The choice and filtering of "cases"**
+  * The "cases" in Table 1 are selected according to the //possibility// of different translations (i.e. each case has at least two translations of the source seen in the training data; the translation counts are from the test data, so it is OK that e.g. "Korea" translates as "Korea" all the time)
+  * Table 1 is unfiltered -- only some of the "cases" are then considered relevant:
+    * Cases that are //too similar// (less than 1/2 characters differ) are //joined together//
+      * Beware, this notion of grouping is not well-defined, does not create equivalence classes: "old hostages" = "new hostages" = "completely new hostages" but "old hostages" != "completely new hostages" (we hope this didn't actually happen)
+    * Cases where //only one translation variant prevails// are //discarded// (this is the case of "Korea")
+==== Sec. 4. Approach ====
+The actual experiments begin only now; the used data is different.
+**Choice of features**
+  * They define 3 features that are designed to be biased towrds consistency -- or are they?
+    * If e.g. two variants are used 2 times each, they will have roughly the same score
+  * The BM25 function is a refined version of the [[http://en.wikipedia.org/wiki/TF-IDF|TF-IDF]] score
+  * The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway)
+   * See NPFL103 for details on Information retrieval, it's largely black magic
+**Feature weights**
+  * The usual model in MT is scoring the hypotheses according to the feature values (''f'') and their weights (''lambda''):
+    * ''score(H) = exp( sum( lambda_i * f_i(H)) )''
+  * The feature weights are trained on a heldout data set using [[http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf|MERT]] (or, here: [[http://en.wikipedia.org/wiki/Margin_Infused_Relaxed_Algorithm|MIRA]])
+  * The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices?
+**Meaning of the individual features**
+  * C1 indicates that a certain Hiero rule was used frequently
+    * but rules are very similar, so we also need something less fine-grained
+  * C2 is a target-side feature, just counts the target side tokens (only the "most important" ones; in terms of TF-IDF)
+    * It may be compared to Language Model features, but is trained only on the target part of the bilingual training data.
+  * C3 counts occurrences of source-target token pairs (and uses the "most important" term pair for each rule, again)
+**Requirements of the new features**
+  * They need two passes through the data
+  * You need to have document segmentation
+    * Since the frequencies are trained on the training set, you can just translate one document at a time, no need to have full sets of documents

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences