This is an old revision of the document!
Table of Contents
Questions
Question 1
REF: John thinks he loves Mary
MT1: John thinks he loves Mary
MT2: John knows he loves Mary
MT3: John thinks he loves RG
Given a test corpus with this one sentence, what are the BLEU scores of the three systems based on formulas (1) and (2)?
Question 2
Imagine you are designing an MT shared task and you have a parallel corpus with 1 million sentences (e.g. Europarl). Which sentences will you select for the test set?
Question 3
Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples?
Question 4
We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 … -5,0,0,0,0,0,0,0,0,0,0,1,2,3 … 970.
Based on Section 6, which system is better - A or B?
With what significance level?
Higher score means better system.
You do not need computer to answer this question, but you can try
perl -E 'say join",",(map {$_*50}(-20..-1)),(0) x 10, 1..970'
Of course, such a result of bootstrap is very strange, but take it as granted for the sake of this quiz.
Question 5
In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (http://en.wikipedia.org/wiki/P-value)
Can you reformulate Section 4 using this view? What is the observed test statistic and what is the null hypothesis?