[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

BLEU: a Method for Automatic Evaluation of Machine Translation

written by Kishore Papineni, Salim Roukos, Todd Ward and Wei-Jing Zhu (IBM T. J. Watson Research Center)

spoken by Jindřich Libovický

reported by Petra Galuščáková

Introduction

The presented paper is one of the basic papers on machine translation (MT) evaluation. Paper describes BLEU score, what is probably currently most often used metrics for MT evaluation. Paper was written in 2001 and it is still often cited despite that a lot of work was done in the field of evaluation of MT during the last years.

Notes

BLEU score is based on the comparison of the automatic (candidate) translation and reference human translations. Basically, counts of the n-grams shared in automatic translation and reference translation are calculated and divided by number of all n-grams. This n-gram precision is further modified. If the number of particular shared n-gram is higher in the candidate translation than in the reference translation, then this count is replaced by the maximum count of this n-gram in reference translation. The BLUE score is then calculated as a linear average of these modified precisions. The brevity penalty is added to the sum to penalize shorter translations than the reference translations.

No, it's not a “linear average of these modified precisions” it's an “arithmetical average of logarithms of modified precisions”, in other words it is a “geometric average of modified precisions”. See Section 2.1.3. — Martin Popel

Jindřich noticed a mistake in section 2 where is written that the phrase “of the party” is shared only with Reference 2, but it is shared also with Reference 3.

Another problem, that was discussed, was found in section 2.2.2. For example if we have three reference translations with lengths 12, 15 and 17 words and our translation has length 14 words. Then, according to the article, our translation is punished, because the closest sentence has length 15, despite the fact, that there also exits shorter reference translation. This was a bit suspicious.

Performed experiments show high correlation of manual ranking and automatic ranking of translation systems. BLEU is able to distinguish between good and bad translations and between translation created by human or by automatic system.
The shortage of the paper is, that these experiments are performed only for English. As further papers show, BLEU score works worse especially for the languages with free word order and for morphologically rich languages.

Conclusion

The paper was well presented and the discussion brought several interesting questions on topics, which were not clean from the paper. The paper is very interesting and readable and it is useful especially for the better understanding of BLEU score.


[ Back to the navigation ] [ Back to the content ]