You can skip sections 4 and 5 in the paper.
1) Section 2.1.1. defines p_n as a fraction where the denominator is “the number of candidate n-grams in the test corpus”.
Compute this denominator for p_3 and a test corpus with three sentences with lengths 3, 4 and 5.
2) Do we need source-language sentences for computing BLEU?
3) Let's have a corpus with two sentences:
Die Katze ist auf der Matte
Lesegruppe ist meine Lieblingsklasse
Reference translation 1:
The cat is on the mat
Reading group is my favourite class
Reference translation 2:
There is a cat on the mat
I love RG
Machine translation:
cat is cat
Reading group is my nightmare
Compute BLEU and BP of the machine translation compared to the two references.
Use the standard BLEU definition, i.e. case insensitive, N=4, w_n=1/4, log(x) is the natural logarithm (ln(x)).
4) We computed a BLEU score for a given test set with three reference translations.
Then a new reference translation became available,
so we computed a new BLEU score for the same test set with four references (three old, one new).
Can the new BLEU score be lower than the old score? Can it be higher? Why?
5) Can you think of any problems in BLEU metrics (for Czech or any other language)? Name them.