REF: John thinks he loves Mary
MT1: John thinks he loves Mary
MT2: John knows he loves Mary
MT3: John thinks he loves RG
Given a test corpus with this one sentence, what are the BLEU scores of the three systems based on formulas (1) and (2)?
Imagine you are designing an MT shared task and you have a parallel corpus with 1 million sentences (e.g. Europarl). Which sentences will you select for the test set?
Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples?
We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 … -50,0,0,0,0,0,0,0,0,0,0,1,2,3 … 970.
Based on Section 6, which system is better - A or B?
With what significance level?
Higher score means better system.
You do not need computer to answer this question, but you can try
perl -E 'say join",",(map {$_*50}(-20..-1)),(0) x 10, 1..970'
Of course, such a result of bootstrap is very strange, but take it as granted for the sake of this quiz.
In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (http://en.wikipedia.org/wiki/P-value)
Can you reformulate Section 4 using this view? What is the observed test statistic and what is the null hypothesis?