[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]

Table of Contents


Question 1

REF: John thinks he loves Mary
MT1: John thinks he loves Mary
MT2: John knows he loves Mary
MT3: John thinks he loves RG
Given a test corpus with this one sentence, what are the BLEU scores of the three systems based on formulas (1) and (2)?

Question 2

Imagine you are designing an MT shared task and you have a parallel corpus with 1 million sentences (e.g. Europarl). Which sentences will you select for the test set?

Question 3

Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples?

Question 4

We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 … -50,0,0,0,0,0,0,0,0,0,0,1,2,3 … 970.

Based on Section 6, which system is better - A or B?

With what significance level?

Higher score means better system.

You do not need computer to answer this question, but you can try

perl -E 'say join",",(map {$_*50}(-20..-1)),(0) x 10, 1..970'

Of course, such a result of bootstrap is very strange, but take it as granted for the sake of this quiz.

Question 5

In statistical hypothesis testing, the p-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (http://en.wikipedia.org/wiki/P-value)

Can you reformulate Section 4 using this view? What is the observed test statistic and what is the null hypothesis?


Section 3

Section 4, 5

Section 6

Martin's explanation of p-values

[ Back to the navigation ] [ Back to the content ]