You can skip sections 4 and 5 in the paper.

1) Section 2.1.1. defines p_n as a fraction where the denominator is “*the number of candidate n-grams in the test corpus*”.

Compute this denominator for **p_3** and a test corpus with three sentences with lengths **3**, **4** and **5**.

2) Do we need source-language sentences for computing BLEU?

3) Let's have a corpus with two sentences:

Die Katze ist auf der Matte

Lesegruppe ist meine Lieblingsklasse

*Reference translation 1:*

The cat is on the mat

Reading group is my favourite class

*Reference translation 2:*

There is a cat on the mat

I love RG

*Machine translation:*

cat is cat

Reading group is my nightmare

Compute **BLEU** and **BP** of the machine translation compared to the two references.

Use the standard BLEU definition, i.e. *case insensitive*, *N=4*, *w_n=1/4*, *log(x)* is the natural logarithm (*ln(x)*).

4) We computed a BLEU score for a given test set with three reference translations.

Then a new reference translation became available,

so we computed a new BLEU score for the same test set with four references (three old, one new).

Can the new BLEU score be lower than the old score? Can it be higher? Why?

5) Can you think of any problems in BLEU metrics (for Czech or any other language)? Name them.