### Table of Contents

# Questions

## Question 1

REF: John thinks he loves Mary

MT1: John thinks he loves Mary

MT2: John knows he loves Mary

MT3: John thinks he loves RG

Given a test corpus with this one sentence, what are the BLEU scores of the three systems based on formulas (1) and (2)?

## Question 2

Imagine you are designing an MT shared task and you have a parallel corpus with 1 million sentences (e.g. Europarl). Which sentences will you select for the test set?

## Question 3

Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples?

## Question 4

We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 … -50,0,0,0,0,0,0,0,0,0,0,1,2,3 … 970.

Based on Section 6, which system is better - A or B?

With what significance level?

Higher score means better system.

You do not need computer to answer this question, but you can try

perl -E 'say join",",(map {$_*50}(-20..-1)),(0) x 10, 1..970'

Of course, such a result of bootstrap is very strange, but take it as granted for the sake of this quiz.

## Question 5

In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (http://en.wikipedia.org/wiki/P-value)

Can you reformulate Section 4 using this view? What is the observed test statistic and what is the null hypothesis?

# Presentation

- We answered:
- Question 1 - BLEU scores are: 1 - 1.0, 2 - not defined (0.0 or some smoothed value in practice), 3 - 0.2 (based on the incorrect formula in the paper which is missing 1/4)
- Question 2 - broad sampling, samples far apart distributed → {data_1, data_101, data_201, …}

## Section 3

- motivation: we don't usually have 30k sentences for testing, so we need an approximate method to obtain reliable scores
- method: divide test set into 100 smaller test sets (300 sentences each)
- consecutive samples - for each of the sets BLEU score varies in range +-8 %
- non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 %

- they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data
- Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggests we can generalize like that

## Section 4, 5

- we cannot use Student's T distribution to estimate confidence interval for BLEU, because it cannot be constructed in the form of sum of terms to give us mean and variance
- so for estimating the confidence intervals we will use randomized test set generation (= bootstrap resampling) - e.g. we build 1000 new test sets of size 300 sentences out of our small test set of 300 sentences (i.e. we draw (with replacement) samples from the small test set; so we should get 1000 different test sets)
- answer to Question3 - they do not assume there is any particular distribution in the set of BLEU scores of the 1000 test sets (i.e. their method would work regardless of whether the distribution is normal, uniform or any other), but it is perhaps normally distributed

## Section 6

- they use bootstrap resampling to compare 2 systems; we want to determine whether system 1 is better than system 2; we want to determine that from a set of differences of system's performances (i.e. difference of score of system 1 and score of system 2)
- so we determine in what percent of cases system 1 beats system 2, and that's our final confidence that system 1 is better than system 2 (e.g. 45 times out of 50 → 90% confidence)

- the rest of the paper just proves that the assumption is correct

## Martin's explanation of p-values

- two philosophical views of p-value - Fisher's and Pearson's - unfortunately their are mixed in modern textbooks which only confuses us
- we usually set a null hypothesis H0 as: systems are the same, and alternative hypothesis HA: there is difference in the systems; P(H0) + P(HA) = 1
- p-value =
- P(T(X)>=T(x_orig)|H0) = P(x|H0) =
*if the compared systems are the same, what's the probability that we see this data* - unfortunately we tend to view the p-value as P(H0|x) which it is not and we need to apply the Bayes' theorem to get it

- bootstrap resampling can be viewed as p-value = P(d(x) < 0|H0) = P(d(x) > 2*d(x_orig)|H0), and is approximated by S/B; where S is number of system 2 beating system 1 and B is number of measurements