This handout includes some notes from the paper as well as a list of statistical tests for the difference of the means.
Questions
(warm-up) The abbreviation “i.i.d.” is used several times throughout the text. What does it mean? independent and identically distributed
(bootstrapping)
In footnote 2, the authors mention that counting the cases with delta<0 is equivalent to counting the cases with delta>2*delta(x). What are the conditions for this equivalence? The mean of delta over all sampled test sets has to be delta(x) and distribution of delta on sampled test sets has to be symetric.
Later in the article, the authors reorder all tested pairs so that delta>0. Do the assumptions from part 2.2 and footnote 2 still hold? The reordering concerns pairs of systems. Throughout the paper, the authors assume that for a given pair of systems, delta(x)>0. The assumptions about symmetry and expected value, and the claim that checking for delta>2*delta(x) is the same as checking for delta<0 concern the bootstrap samples. (For a pair of systems for which delta(x)<0, the equivalent conditions for the bootstrap would be delta<2*delta(x) and delta>0.)
(a recurrent theme) Why is a small metric gain more significant between similar systems? See the notes in pdf; basically, the larger the correlation between the systems, the smaller the variance of the difference between the metrics (on sentences or documents). Variance appears in the denominator in the various formulas for the test statistics, so smaller variance leads to larger t-value and smaller p-value. More intuitively, smaller variance means more confidence that whatever result we obtain is not due to chance.
(important) Sum up (in 3-5 sentences) what you want to remember from reading this paper.
(creative) Formulate at least 1 question that you would like to ask the authors of the paper.
They said that GIZA++ failed to produce reasonable output when trained with some of these training sets ( 20 training sets among 1.1M sentences). Why?