[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2012:encouraging-consistent-translation [2012/10/17 11:59]
dusek
courses:rg:2012:encouraging-consistent-translation [2012/10/23 11:04] (current)
popel my remarks
Line 41: Line 41:
  
 **Choice of features** **Choice of features**
-  * They define 3 features that are designed to be biased towrds consistency -- or are they?+  * They define 3 features that are designed to be biased towards consistency -- or are they?
     * If e.g. two variants are used 2 times each, they will have roughly the same score     * If e.g. two variants are used 2 times each, they will have roughly the same score
   * The BM25 function is a refined version of the [[http://en.wikipedia.org/wiki/TF-IDF|TF-IDF]] score   * The BM25 function is a refined version of the [[http://en.wikipedia.org/wiki/TF-IDF|TF-IDF]] score
Line 74: Line 74:
     * This is in order to speed up the experiment, they don't want to wait for MIRA twice.     * This is in order to speed up the experiment, they don't want to wait for MIRA twice.
  
-**Different evaluation metrics**+**The usage of two variants of the BLEU evaluation metric**
   * The BLEU variants do not differ that much, only in Brevity Penalty for multiple references   * The BLEU variants do not differ that much, only in Brevity Penalty for multiple references
     * IBM BLEU uses the reference that is closest to the MT output (in terms of length), NIST BLEU uses the shortest one     * IBM BLEU uses the reference that is closest to the MT output (in terms of length), NIST BLEU uses the shortest one
   * This was probably just due to some technical reasons, e.g. they had their optimization software designed for one metric and not the other   * This was probably just due to some technical reasons, e.g. they had their optimization software designed for one metric and not the other
  
 +**Performance**
 +  * Adding any single feature improves the performance, the combination of all of them is even better
 +    * "Most works in MT just achieve 1 BLEU point improvement, then it is ready to be published" :-)
 +  * There are no significance tests !!
 +    * Even 1.0 BLEU doesn't have to be significant if there is a lot of changes
 +
 +**Selection of the sentences for the analysis**
 +  * They only select cases with bigger differences due to filtering -- this leads to skewed selection of sentences where BLEU changes more
 +    * This can lead to an improvement in "big" things (such as selection of content words), but a worsening in "small" things (grammatical words, inflection)
 +
 +**BLEU deficiency**
 +  * The authors argue that some of the sentences are not worsened, but since the changed words do not appear in the reference, BLEU scoring hurts their system
 +  * We believe this argument is misleading:
 +    * The baseline has the same problem
 +    * The human translators use different expression in the reference for a reason (even if the meaning is roughly the same, there can be style differences etc.)
 +    * We must be careful when we criticize BLEU -- it is all too easy to find single sentences where it failed
 +      * It's always better to back up our argument by human rankings
 +    * Why didn't they run METEOR or other metric and left it for future work?
 +
 +==== Sec. 6. Conclusions ====
 +**Structural variation moderation**
 +  * Sounds a bit sci-fi, but very interesting
 +
 +**Choice of discourse context**
 +  * It's true that choosing just document works well for news articles, but not for most of the content we wish to translate
 +  * Domain feature, topic modelling or word classes should be worth trying
 +
 +===== Our conclusion =====
 +
 +Nice paper with a very good idea that probably can improve translations, but with several arguments that are not backed up by sufficient evidence or clearly misleading. The initial analysis is done in a very precise and detailed way. The actual translation experiments show that adding new features helps, but lack some obvious steps, such as significance checking or actually proving that BLEU is wrong by using METEOR or similar metric.
 +
 +===== Martin's remarks =====
 +  * The approach (without modifications) does not seem to be suitable for translating to a morphologically rich language. Different forms of the same lemma would be considered different senses (if not grouped together due to 1/2 of character being same), so the system would produce e.g. only nominatives.
 +  * Also, there should be a modification for source-side words with more possible PoS. E.g. "book" as a noun should be translated differently than as a verb and you can easily find both in one document.

[ Back to the navigation ] [ Back to the content ]