Table of Contents

Statistical Significance Tests for Machine Translation Evaluation

Koehn, EMNLP 2004, link

Questions

1) BLEU_MT1 = 1, BLEU_MT2 = 0 (or undefined)
BLEU_MT3 = 0.2 (according to the formula in the paper, incorrect)
It should be exp(1/4(ln(4/5) + ln(3/4) + ln(2/3) + ln(1/2))) = 0.668

2) We should somehow sample the corpus (maybe take each k-th sentence, or create an entirely random samples).
This might however cause problems to systems which try to benefit from broader context features (that go beyond the sentence, e.g. to promote discourse coherence), so maybe take sample batches of sentences (e.g. 10).

Presentation

Introduction

Philipp Koehn's paper, so the MT system is probably Pharaoh (predecessor of Moses).

How to obtain multiple translation systems? Translate into English, from a number of different languages (trained on Europarl).

Section 3

Initial experiment: divide 30000 translated sentences into consecutive chunks of 300 sentences (100 test sets). BLEUs measured on individual test sets then vary quite a bit.

Compared with broad sampling, created 100 test sets:
1, 301, 601,… test set 1
2, 302, 602,… test set 2

BLEU scores become more stable ⇒ this procedure leads to a more representative test set.

We are not sure whether this can really be generalized – we only have two very similar systems (identical systems, only trained on different data).

Section 4

Significance tests are used to estimate an interval in which the true system score lies. We use Student's t-distribution to approximate the normal distribution, according to which the sentence scores are distributed (that is our assumption).

We are interested in the mean of sentence scores, which (given enough samples) is normally distributed according to the Central Limit Theorem, so the assumption is OK.

Section 5

We cannot use the t-distribution for BLEU because it cannot be factorised like the metric in Section 4: BLEU computed on the full test set != average of per-sentence BLEU scores.

We can (reliably) estimate confidence intervals by translating a large number of test sets and observing the mean and variance. But we don't want to do that.

The key assumption: drawing sentences from one test set with replacement K times is as good as having K different test sets (for estimating the mean and variance of the BLEU score).

Generated test sets are of the same size as the original test set. We don't want to take less because it would weaken our test.

Bootstrap resampling gives us K BLEU scores. We sort them, discard the top and bottom 2.5% (so 95% of the scores remain). The minimum and maximum then defines the confidence interval.

We do not assume normal distribution (answering Question 3) of the scores – but they will be normally distributed anyway (we don't know if we could prove it though). Bootstrap resampling is distribution-independent in general.

Section 6

Boostrap resampling for pairwise comparisons. Compute the BLEU scores for each system on all bootstrapped test sets. If A beats B in 950 out of 1000 test sets, then A is better with 95% confidence.

Answering Question 4: A is better with 97% confidence (and at least as good with 98% confidence).

We don't think this corresponds to confidence – it's just the proportion of times that A beat B, a kind of ML estimate. In fact, we could use the differences between systems as input for a paired t-test (and get a true confidence interval).

Notes on p-value: concepts of confidence intervals/significance testing were developed independently by different researches, then they were somehow merged. P-value is often misunderstood and misused.

Question 5: let's define X as the population (all possible sentences), we have 2 systems evaluated on a sample S.
The test statistic T is the difference between A and B, T = delta(score(A), score(B)).
Null hypothesis is “true score of A is equal to B”.
p-value … in general, the probability of observing data at least as extreme given that the null hypothesis holds, formally:

p-value = P(T(X) >= T(S) | H_0)

We do not prove our (alternative) hypothesis – we can (with some confidence) reject the null hypothesis.

The p-value is not the probability of the null hypothesis. It only says how (im)probable the data is given H_0, i.e. P(S|H_0), not P(H_0|S).