Statistical Significance Tests for Machine Translation Evaluation

Koehn, EMNLP 2004, link

Questions

1) BLEU_MT1 = 1, BLEU_MT2 = 0 (or undefined)
BLEU_MT3 = 0.2 (according to the formula in the paper, incorrect)
It should be exp(1/4(ln(4/5) + ln(3/4) + ln(2/3) + ln(1/2))) = 0.668

2) We should somehow sample the corpus (maybe take each k-th sentence, or create an entirely random samples).
This might however cause problems to systems which try to benefit from broader context features (that go beyond the sentence, e.g. to promote discourse coherence), so maybe take sample batches of sentences (e.g. 10).

Presentation

Philipp Koehn's paper, so the MT system is probably Pharaoh (predecessor of Moses).

How to obtain multiple translation systems? Translate into English, from a number of different languages (trained on Europarl).

Initial experiment: divide 30000 translated sentences into consecutive chunks of 300 sentences (100 test sets). BLEUs measured on individual test sets then vary quite a bit.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Statistical Significance Tests for Machine Translation Evaluation

Questions

Presentation