This is an old revision of the document!
gg
Martin's questions
1. The question aimed to find out whether we would repeatedly count the matching n-grams between the MT output and the reference. They can be pre-computed for each sentence and then aggregated without recurring to string matching.
2. p = 1.
Note that this is unrealistic, as these systems can still easily yield a better and a worse result one or the other.
3., 4.
p (x>y) | p (x>=y) | |
---|---|---|
approx. rand | 1.00 | (0+1)/(10000+1) |
boot. Riezler | 0.26 | (0+1)/(10000+1) |
boot. Koehn | 0.37 | 0 |
Notes
In the following, [i, j] refers to the i-th content row, j-th content column of the above table.
[3,1] … MT_Y can never be better than MT_X,
and H_0 says that (MT_X < MT_Y)
[2,1]:
p( bootstrap sample does not include x_diff (the different output item) ) = (1 - 1/1000)^1000 ~ 0.3677
p( bootstrap sample includes x_diff exactly once ) = (1000 over 1) * (1 - 1/1000)^1000 ~ 0.3681
p( bootstrap sample include x_diff at least twice ) ~ 1 - 0.3677 - 0.3681 = 0.2642
For p-value in bootstrap, we count the cases where the statistic is at least twice the point estimate. Since the basic dataset contains exactly one different item, under certain assumptions, we need the sets to differ in at least two items, i.e. contain the only different item at least twice.
Result reporting in general
If your experiment involves uncertainty, and you want to report the results as reliable, do not run the experiment only once.
_
Why do they say, “this results in BLEU favoring matches in larger n-grams, corresponding to giving more credit to correct word order. NIST weighs lower n-grams more highly, thus it gives more credit to correct lexical choice than to word order”?
How is NIST defined, anyway?
…see the formula, blindly copied from the whiteboard.
We have not found the motivation for their above cited claim.
F1-score
Why have not they come up with a more informative name for this metric (F-score applied to dependency parses)? It remains implicit that the score is computed (most probably) from machine-created parses. This might have given some preference to some of the SMT systems.
_
“(…) estimated more conservatively by approximate randomization than by bootstrap tests, thus increasing the likelihood of type-I error for the latter.”
…what a strange wording! The approximative randomisation is not to blame for errors of bootstrap.
_
They say that approximate randomisation is more conservative. But does this mean it is better for practical application??
…well, they aim at the assumption noted in 4.3: “Assuming equivalence of the compared system variants, these assessments would count as type-I errors.”
_
At the end of the day, we want to extrapolate the significance measuring results to all possible inputs anyway. Hence, it is reasonable to adopt the assumption of bootstrap, namely that the sampling distribution is representative of the entire input space.
p-value confusion
Near the top of 4.4, p-value gets confused with the probability of type-I errors.
_
Final note: the paper is perhaps too condensed, and hard to grasp at places.