This is an old revision of the document!
Table of Contents
Encouraging Consistent Translation Choices
Ferhan Ture, Douglas W. Oard, and Philip Resnik
NAACL 2012
PDF
Outline -- discussion
The list of discussed topics follows the outline of the paper:
Sec. 2. Related Work
Differences from Carpuat 2009
- It is different: the decoder just gets additional features, but the decision is up to it – Carpuat 2009 just post-edits the outputs and substitutes the most likely variant everywhere
- Using Carpuat 2009's approach directly in the decoder would influence neighboring words through LM, so even using this in the decoder and not as post-editing leads to a different outcome
Human translators and one sense per discourse
- This suggests that modelling human translators is the same as modelling one sense per discourse – this is suspicious
- The authors do not state their evidence clearly.
- One sense is not the same as one translation
Sec. 3. Exploratory analysis
Hiero
- The idea would most probably work the same in normal phrase-based SMT, but the authors use hierarchical phrase-based translation (Hiero)
- Hiero is summarized in Fig. 1: the phrases may contain non-terminals (
X
,X1
etc.), which leads to a probabilistic CFG and bottom-up parsing
- The authors chose the
cdec
implementation of Hiero (which is implemented in several systems: Moses, cdec, Joshua etc.)- The choice was probably arbitrary, other systems would yield similar results
Forced decoding
- This means that the decoder is given source and target sentence and has to provide the rules/phrases that map from the source to the target
- The decoder might be unable to find the appropriate rules (for unseen words)
- It is a different decoder mode, for which it must be adjusted
- Forced decoding is much more informative for Hiero translations than for “plain” phrase-based ones, since there are many different parse trees that yield the same target string, and not as much phrases
The choice and filtering of “cases”
- The “cases” in Table 1 are selected according to the possibility of different translations (i.e. each case has at least two translations of the source seen in the training data; the translation counts are from the test data, so it is OK that e.g. “Korea” translates as “Korea” all the time)
- Table 1 is unfiltered – only some of the “cases” are then considered relevant:
- Cases that are too similar (less than 1/2 characters differ) are joined together
- Beware, this notion of grouping is not well-defined, does not create equivalence classes: “old hostages” = “new hostages” = “completely new hostages” but “old hostages” != “completely new hostages” (we hope this didn't actually happen)
- Cases where only one translation variant prevails are discarded (this is the case of “Korea”)
Sec. 4. Approach
The actual experiments begin only now; the used data is different.
Choice of features
- They define 3 features that are designed to be biased towrds consistency – or are they?
- If e.g. two variants are used 2 times each, they will have roughly the same score
- The BM25 function is a refined version of the TF-IDF score
- The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway)
- See NPFL103 for details on Information retrieval, it's largely black magic
Feature weights
- The usual model in MT is scoring the hypotheses according to the feature values (
f
) and their weights (lambda
):score(H) = exp( sum( lambda_i * f_i(H)) )
- The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices?
Meaning of the individual features
- C1 indicates that a certain Hiero rule was used frequently
- but rules are very similar, so we also need something less fine-grained
- C2 is a target-side feature, just counts the target side tokens (only the “most important” ones; in terms of TF-IDF)
- It may be compared to Language Model features, but is trained only on the target part of the bilingual tuning data.
- C3 counts occurrences of source-target token pairs (and uses the “most important” term pair for each rule, again)
Requirements of the new features
- They need two passes through the data
- You need to have document segmentation
- Since the frequencies are trained on the tuning set (see Sec. 5), you can just translate one document at a time, no need to have full sets of documents
Sec. 5. Evaluation and Discussion
Choice of baseline
- Baselines are quite nice and competitive, we believe this really is an improvement
- MIRA is very cutting-edge
Tuning the feature weights
- For the 1st phase, “heuristically” probably means they just used some reasonable enough values, e.g. from earlier experiments
- This is in order to speed up the experiment, they don't want to wait for MIRA twice.
The usage of two variants of the BLEU evaluation metric
- The BLEU variants do not differ that much, only in Brevity Penalty for multiple references
- IBM BLEU uses the reference that is closest to the MT output (in terms of length), NIST BLEU uses the shortest one
- This was probably just due to some technical reasons, e.g. they had their optimization software designed for one metric and not the other
Performance
- Adding any single feature improves the performance, the combination of all of them is even better
- “Most works in MT just acchieve 1 BLEU point improvement, then it is ready to be published”
- There are no significance tests !!
- Even 1.0 BLEU doesn't have to be significant if there is a lot of changes
Selection of the sentences for the analysis
- They only select cases with bigger differences due to filtering – this leads to skewed selection of sentences where BLEU changes more
- This can lead to an improvement in “big” things (such as selection of content words), but a worsening in “small” things (grammatical words, inflection)
BLEU deficiency
- The authors argue that some of the sentences are not worsened, but since the changed words do not appear in the reference, BLEU scoring hurts their system
- We believe this argument is misleading:
- The baseline has the same problem
- The human translators use different expression in the reference for a reason (even if the meaning is roughly the same, there can be style differences etc.)
- We must be careful when we critisize BLEU – it is all too easy to find single sentences where it failed
- It's always better to back up our argument by human rankings
- Why didn't they run METEOR or other metric and left it for future work?
Sec. 6. Conclusions
Structural variation moderation
- Sounds a bit sci-fi, but very interesting
Choice of discourse context
- It's true that choosing just document works well for news articles, but not for most of the content we wish to translate
- Domain feature, topic modelling or word classes should be worth trying