Encouraging Consistent Translation Choices
Ferhan Ture, Douglas W. Oard, and Philip Resnik
NAACL 2012
PDF
Outline -- discussion
The list of discussed topics follows the outline of the paper:
Differences from Carpuat 2009
Human translators and one sense per discourse
Sec. 3. Exploratory analysis
Hiero
The idea would most probably work the same in normal phrase-based SMT, but the authors use hierarchical phrase-based translation (Hiero)
The authors chose the cdec
implementation of Hiero (which is implemented in several systems: Moses, cdec, Joshua etc.)
Forced decoding
The choice and filtering of “cases”
The “cases” in Table 1 are selected according to the possibility of different translations (i.e. each case has at least two translations of the source seen in the training data; the translation counts are from the test data, so it is OK that e.g. “Korea” translates as “Korea” all the time)
Table 1 is unfiltered – only some of the “cases” are then considered relevant:
Sec. 4. Approach
The actual experiments begin only now; the used data is different.
Choice of features
They define 3 features that are designed to be biased towards consistency – or are they?
The BM25 function is a refined version of the
TF-IDF score
The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway)
See NPFL103 for details on Information retrieval, it's largely black magic
Feature weights
The usual model in MT is scoring the hypotheses according to the feature values (f
) and their weights (lambda
):
The feature weights are trained on a heldout data set using
MERT (or, here:
MIRA)
The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices?
Meaning of the individual features
C1 indicates that a certain Hiero rule was used frequently
C2 is a target-side feature, just counts the target side tokens (only the “most important” ones; in terms of TF-IDF)
C3 counts occurrences of source-target token pairs (and uses the “most important” term pair for each rule, again)
Requirements of the new features
Sec. 5. Evaluation and Discussion
Choice of baseline
Baselines are quite nice and competitive, we believe this really is an improvement
MIRA is very cutting-edge
Tuning the feature weights
The usage of two variants of the BLEU evaluation metric
The BLEU variants do not differ that much, only in Brevity Penalty for multiple references
This was probably just due to some technical reasons, e.g. they had their optimization software designed for one metric and not the other
Performance
Adding any single feature improves the performance, the combination of all of them is even better
“Most works in MT just achieve 1 BLEU point improvement, then it is ready to be published”
There are no significance tests !!
Selection of the sentences for the analysis
BLEU deficiency
The authors argue that some of the sentences are not worsened, but since the changed words do not appear in the reference, BLEU scoring hurts their system
We believe this argument is misleading:
The baseline has the same problem
The human translators use different expression in the reference for a reason (even if the meaning is roughly the same, there can be style differences etc.)
We must be careful when we criticize BLEU – it is all too easy to find single sentences where it failed
Why didn't they run METEOR or other metric and left it for future work?
Sec. 6. Conclusions
Structural variation moderation
Choice of discourse context
It's true that choosing just document works well for news articles, but not for most of the content we wish to translate
Domain feature, topic modelling or word classes should be worth trying
Our conclusion
Nice paper with a very good idea that probably can improve translations, but with several arguments that are not backed up by sufficient evidence or clearly misleading. The initial analysis is done in a very precise and detailed way. The actual translation experiments show that adding new features helps, but lack some obvious steps, such as significance checking or actually proving that BLEU is wrong by using METEOR or similar metric.
The approach (without modifications) does not seem to be suitable for translating to a morphologically rich language. Different forms of the same lemma would be considered different senses (if not grouped together due to 1/2 of character being same), so the system would produce e.g. only nominatives.
Also, there should be a modification for source-side words with more possible PoS. E.g. “book” as a noun should be translated differently than as a verb and you can easily find both in one document.