[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
courses:rg:2012:encouraging-consistent-translation [2012/10/16 14:44]
dusek vytvořeno
courses:rg:2012:encouraging-consistent-translation [2012/10/17 13:41]
dusek
Line 4: Line 4:
 [[http://aclweb.org/anthology-new/N/N12/N12-1046.pdf|PDF]] [[http://aclweb.org/anthology-new/N/N12/N12-1046.pdf|PDF]]
  
-===== Outline =====+ 
 +===== Outline -- discussion ===== 
 +The list of discussed topics follows the outline of the paper: 
 +==== Sec. 2. Related Work ==== 
 +**Differences from Carpuat 2009** 
 +  * It is different: the decoder just gets additional features, but the decision is up to it -- Carpuat 2009 just post-edits the outputs and substitutes the most likely variant everywhere 
 +    * Using Carpuat 2009's approach directly in the decoder would influence neighboring words through LM, so even using this in the decoder and not as post-editing leads to a different outcome 
 + 
 +**Human translators and one sense per discourse** 
 +  * This suggests that modelling human translators is the same as modelling one sense per discourse -- this is suspicious 
 +    * The authors do not state their evidence clearly. 
 +    * One sense is not the same as one translation 
 + 
 +==== Sec. 3. Exploratory analysis ==== 
 +**Hiero** 
 +  * The idea would most probably work the same in normal phrase-based SMT, but the authors use hierarchical phrase-based translation (Hiero) 
 +    * Hiero is summarized in Fig. 1: the phrases may contain non-terminals (''X'', ''X1'' etc.), which leads to a probabilistic CFG and bottom-up parsing 
 +  * The authors chose the ''cdec'' implementation of Hiero (which is implemented in several systems: Moses, cdec, Joshua etc.) 
 +    * The choice was probably arbitrary, other systems would yield similar results 
 + 
 +**Forced decoding** 
 +  * This means that the decoder is given source //and// target sentence and has to provide the rules/phrases that map from the source to the target 
 +    * The decoder might be unable to find the appropriate rules (for unseen words) 
 +    * It is a different decoder mode, for which it must be adjusted 
 +    * Forced decoding is much more informative for Hiero translations than for "plain" phrase-based ones, since there are many different parse trees that yield the same target string, and not as much phrases 
 + 
 +**The choice and filtering of "cases"** 
 +  * The "cases" in Table 1 are selected according to the //possibility// of different translations (i.e. each case has at least two translations of the source seen in the training data; the translation counts are from the test data, so it is OK that e.g. "Korea" translates as "Korea" all the time) 
 +  * Table 1 is unfiltered -- only some of the "cases" are then considered relevant: 
 +    * Cases that are //too similar// (less than 1/2 characters differ) are //joined together// 
 +      * Beware, this notion of grouping is not well-defined, does not create equivalence classes: "old hostages" = "new hostages" = "completely new hostages" but "old hostages" != "completely new hostages" (we hope this didn't actually happen) 
 +    * Cases where //only one translation variant prevails// are //discarded// (this is the case of "Korea"
 + 
 +==== Sec. 4. Approach ==== 
 +The actual experiments begin only now; the used data is different. 
 + 
 +**Choice of features** 
 +  * They define 3 features that are designed to be biased towrds consistency -- or are they? 
 +    * If e.g. two variants are used 2 times each, they will have roughly the same score 
 +  * The BM25 function is a refined version of the [[http://en.wikipedia.org/wiki/TF-IDF|TF-IDF]] score 
 +  * The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway) 
 +   * See NPFL103 for details on Information retrieval, it's largely black magic 
 + 
 +**Feature weights** 
 +  * The usual model in MT is scoring the hypotheses according to the feature values (''f'') and their weights (''lambda''):  
 +    * ''score(H) = exp( sum( lambda_i * f_i(H)) )'' 
 +  * The feature weights are trained on a heldout data set using [[http://acl.ldc.upenn.edu/acl2003/main/pdfs/Och.pdf|MERT]] (or, here: [[http://en.wikipedia.org/wiki/Margin_Infused_Relaxed_Algorithm|MIRA]]) 
 +  * The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices? 
 + 
 +**Meaning of the individual features** 
 +  * C1 indicates that a certain Hiero rule was used frequently 
 +    * but rules are very similar, so we also need something less fine-grained 
 +  * C2 is a target-side feature, just counts the target side tokens (only the "most important" ones; in terms of TF-IDF) 
 +    * It may be compared to Language Model features, but is trained only on the target part of the bilingual tuning data. 
 +  * C3 counts occurrences of source-target token pairs (and uses the "most important" term pair for each rule, again) 
 + 
 +**Requirements of the new features** 
 +  * They need two passes through the data 
 +  * You need to have document segmentation 
 +    * Since the frequencies are trained on the tuning set (see Sec. 5), you can just translate one document at a time, no need to have full sets of documents 
 + 
 +==== Sec. 5. Evaluation and Discussion ==== 
 +**Choice of baseline** 
 +  * Baselines are quite nice and competitive, we believe this really is an improvement 
 +  * MIRA is very cutting-edge 
 + 
 +**Tuning the feature weights** 
 +  * For the 1st phase, "heuristically" probably means they just used some reasonable enough values, e.g. from earlier experiments 
 +    * This is in order to speed up the experiment, they don't want to wait for MIRA twice. 
 + 
 +**The usage of two variants of the BLEU evaluation metric** 
 +  * The BLEU variants do not differ that much, only in Brevity Penalty for multiple references 
 +    * IBM BLEU uses the reference that is closest to the MT output (in terms of length), NIST BLEU uses the shortest one 
 +  * This was probably just due to some technical reasons, e.g. they had their optimization software designed for one metric and not the other 
 + 
 +**Performance** 
 +  * Adding any single feature improves the performance, the combination of all of them is even better 
 +    * "Most works in MT just acchieve 1 BLEU point improvement, then it is ready to be published" :-) 
 +  * There are no significance tests !! 
 +    * Even 1.0 BLEU doesn't have to be significant if there is a lot of changes 
 + 
 +**Selection of the sentences for the analysis** 
 +  * They only select cases with bigger differences due to filtering -- this leads to skewed selection of sentences where BLEU changes more 
 +    * This can lead to an improvement in "big" things (such as selection of content words), but a worsening in "small" things (grammatical words, inflection) 
 + 
 +**BLEU deficiency** 
 +  * The authors argue that some of the sentences are not worsened, but since the changed words do not appear in the reference, BLEU scoring hurts their system 
 +  * We believe this argument is misleading: 
 +    * The baseline has the same problem 
 +    * The human translators use different expression in the reference for a reason (even if the meaning is roughly the same, there can be style differences etc.) 
 +    * We must be careful when we critisize BLEU -- it is all too easy to find single sentences where it failed 
 +      * It's always better to back up our argument by human rankings 
 +    * Why didn't they run METEOR or other metric and left it for future work? 
 + 
 +==== Sec. 6. Conclusions ==== 
 +**Structural variation moderation** 
 +  * Sounds a bit sci-fi, but very interesting 
 + 
 +**Choice of discourse context** 
 +  * It's true that choosing just document works well for news articles, but not for most of the content we wish to translate 
 +  * Domain feature, topic modelling or word classes should be worth trying

[ Back to the navigation ] [ Back to the content ]