Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:encouraging-consistent-translation [2012/10/16 14:56] dusek |
courses:rg:2012:encouraging-consistent-translation [2012/10/23 11:04] (current) popel my remarks |
||
---|---|---|---|
Line 8: | Line 8: | ||
The list of discussed topics follows the outline of the paper: | The list of discussed topics follows the outline of the paper: | ||
==== Sec. 2. Related Work ==== | ==== Sec. 2. Related Work ==== | ||
- | | + | **Differences |
- | * Yes: the decoder just gets additional features, but the decision is up to it -- Carpuat 2009 just post-edits the outputs and substitutes the most likely variant everywhere | + | * It is different: the decoder just gets additional features, but the decision is up to it -- Carpuat 2009 just post-edits the outputs and substitutes the most likely variant everywhere |
- | * Using Carpuat 2009's approach directly in the decoder would influence neighboring words through LM, so even using this in the decoder and not as post-editing leads to a different outcome | + | * Using Carpuat 2009's approach directly in the decoder would influence neighboring words through LM, so even using this in the decoder and not as post-editing leads to a different outcome |
- | * //Human translators and one sense per discourse// | + | |
- | * This suggests that modelling human translators is the same as modelling one sense per discourse -- this is suspicious | + | **Human translators and one sense per discourse** |
- | * The authors do not state their evidence clearly. | + | * This suggests that modelling human translators is the same as modelling one sense per discourse -- this is suspicious |
- | * One sense is not the same as one translation | + | * The authors do not state their evidence clearly. |
+ | * One sense is not the same as one translation | ||
==== Sec. 3. Exploratory analysis ==== | ==== Sec. 3. Exploratory analysis ==== | ||
+ | **Hiero** | ||
+ | * The idea would most probably work the same in normal phrase-based SMT, but the authors use hierarchical phrase-based translation (Hiero) | ||
+ | * Hiero is summarized in Fig. 1: the phrases may contain non-terminals ('' | ||
+ | * The authors chose the '' | ||
+ | * The choice was probably arbitrary, other systems would yield similar results | ||
+ | |||
+ | **Forced decoding** | ||
+ | * This means that the decoder is given source //and// target sentence and has to provide the rules/ | ||
+ | * The decoder might be unable to find the appropriate rules (for unseen words) | ||
+ | * It is a different decoder mode, for which it must be adjusted | ||
+ | * Forced decoding is much more informative for Hiero translations than for " | ||
+ | |||
+ | **The choice and filtering of " | ||
+ | * The " | ||
+ | * Table 1 is unfiltered -- only some of the " | ||
+ | * Cases that are //too similar// (less than 1/2 characters differ) are //joined together// | ||
+ | * Beware, this notion of grouping is not well-defined, | ||
+ | * Cases where //only one translation variant prevails// are // | ||
+ | |||
+ | ==== Sec. 4. Approach ==== | ||
+ | The actual experiments begin only now; the used data is different. | ||
+ | |||
+ | **Choice of features** | ||
+ | * They define 3 features that are designed to be biased towards consistency -- or are they? | ||
+ | * If e.g. two variants are used 2 times each, they will have roughly the same score | ||
+ | * The BM25 function is a refined version of the [[http:// | ||
+ | * The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway) | ||
+ | * See NPFL103 for details on Information retrieval, it's largely black magic | ||
+ | |||
+ | **Feature weights** | ||
+ | * The usual model in MT is scoring the hypotheses according to the feature values ('' | ||
+ | * '' | ||
+ | * The feature weights are trained on a heldout data set using [[http:// | ||
+ | * The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices? | ||
+ | |||
+ | **Meaning of the individual features** | ||
+ | * C1 indicates that a certain Hiero rule was used frequently | ||
+ | * but rules are very similar, so we also need something less fine-grained | ||
+ | * C2 is a target-side feature, just counts the target side tokens (only the "most important" | ||
+ | * It may be compared to Language Model features, but is trained only on the target part of the bilingual tuning data. | ||
+ | * C3 counts occurrences of source-target token pairs (and uses the "most important" | ||
+ | |||
+ | **Requirements of the new features** | ||
+ | * They need two passes through the data | ||
+ | * You need to have document segmentation | ||
+ | * Since the frequencies are trained on the tuning set (see Sec. 5), you can just translate one document at a time, no need to have full sets of documents | ||
+ | |||
+ | ==== Sec. 5. Evaluation and Discussion ==== | ||
+ | **Choice of baseline** | ||
+ | * Baselines are quite nice and competitive, | ||
+ | * MIRA is very cutting-edge | ||
+ | |||
+ | **Tuning the feature weights** | ||
+ | * For the 1st phase, " | ||
+ | * This is in order to speed up the experiment, they don't want to wait for MIRA twice. | ||
+ | |||
+ | **The usage of two variants of the BLEU evaluation metric** | ||
+ | * The BLEU variants do not differ that much, only in Brevity Penalty for multiple references | ||
+ | * IBM BLEU uses the reference that is closest to the MT output (in terms of length), NIST BLEU uses the shortest one | ||
+ | * This was probably just due to some technical reasons, e.g. they had their optimization software designed for one metric and not the other | ||
+ | |||
+ | **Performance** | ||
+ | * Adding any single feature improves the performance, | ||
+ | * "Most works in MT just achieve 1 BLEU point improvement, | ||
+ | * There are no significance tests !! | ||
+ | * Even 1.0 BLEU doesn' | ||
+ | |||
+ | **Selection of the sentences for the analysis** | ||
+ | * They only select cases with bigger differences due to filtering -- this leads to skewed selection of sentences where BLEU changes more | ||
+ | * This can lead to an improvement in " | ||
+ | |||
+ | **BLEU deficiency** | ||
+ | * The authors argue that some of the sentences are not worsened, but since the changed words do not appear in the reference, BLEU scoring hurts their system | ||
+ | * We believe this argument is misleading: | ||
+ | * The baseline has the same problem | ||
+ | * The human translators use different expression in the reference for a reason (even if the meaning is roughly the same, there can be style differences etc.) | ||
+ | * We must be careful when we criticize BLEU -- it is all too easy to find single sentences where it failed | ||
+ | * It's always better to back up our argument by human rankings | ||
+ | * Why didn't they run METEOR or other metric and left it for future work? | ||
+ | |||
+ | ==== Sec. 6. Conclusions ==== | ||
+ | **Structural variation moderation** | ||
+ | * Sounds a bit sci-fi, but very interesting | ||
+ | |||
+ | **Choice of discourse context** | ||
+ | * It's true that choosing just document works well for news articles, but not for most of the content we wish to translate | ||
+ | * Domain feature, topic modelling or word classes should be worth trying | ||
+ | |||
+ | ===== Our conclusion ===== | ||
+ | |||
+ | Nice paper with a very good idea that probably can improve translations, | ||
+ | |||
+ | ===== Martin' | ||
+ | * The approach (without modifications) does not seem to be suitable for translating to a morphologically rich language. Different forms of the same lemma would be considered different senses (if not grouped together due to 1/2 of character being same), so the system would produce e.g. only nominatives. | ||
+ | * Also, there should be a modification for source-side words with more possible PoS. E.g. " |