Martin's questions

1)
How would you implement approximate randomization for BLEU based on Figure 1,
namely the part "Shuffle variable tuples between system X and Y with probability 0.5"?
What are the variable tuples? Can you write a more detailed pseudo (or C,Java,Perl,...) code?
How would you implement the next part "Compute pseudo-statistic |S_Xr − S_Yr | on shuffled data"?
2)
On a testset of 1000 sentences, systems X and Y have exactly the same output except for one sentence:
REF = Hello
MT_X= Hello
MT_Y= Hi
You computed approximate randomization test (based on Figure 1, R=10000 samples)
to check whether the improvement in BLEU is significant. What were the results (i.e. p-value)?
3)
What would be the p-value for bootstrap test based on a) Figure 2, b) Koehn2004 (the last RG paper)?
This is a bit tricky. Just estimate the expected value of p-value (i.e. 1 - level_of_confidence).
4)
What would be the p-value for non-strict inequality, i.e. hypothesis "system X is better or equal than Y"?

1. The question aimed to find out whether we would repeatedly count the matching n-grams between the MT output and the reference. They can be pre-computed for each sentence and then aggregated without recurring to string matching.

2. p = 1.

Note that this is unrealistic, as these systems can still easily yield a better and a worse result one or the other.

3., 4.

p (x>y) p (x≥y)
approx. rand 1.00 (0+1)/(10000+1)
boot. Riezler 0.26 (0+1)/(10000+1)
boot. Koehn 0.37 0
Notes

In the following, [i, j] refers to the i-th content row, j-th content column of the above table.

[3,2] … MT_Y can never be better than MT_X,
and H_0 says that (MT_X < MT_Y)

[2,1]:

p( bootstrap sample does not include x_diff (the different output item) )
  = (1 - 1/1000)^1000
  ~ 0.3677
p( bootstrap sample includes x_diff exactly once )
  = (1000 over 1) * (1 - 1/1000)^1000
  ~ 0.3681
p( bootstrap sample include x_diff at least twice )
  ~ 1 - 0.3677 - 0.3681
  = 0.2642

For p-value in bootstrap, we count the cases where the statistic is at least twice the point estimate. Since the basic dataset contains exactly one different item, under certain assumptions, we need the sets to differ in at least two items, i.e. contain the only different item at least twice.

Result reporting in general

If your experiment involves uncertainty, and you want to report the results as reliable, do not run the experiment only once.

_

Why do they say, “this results in BLEU favoring matches in larger n-grams, corresponding to giving more credit to correct word order. NIST weighs lower n-grams more highly, thus it gives more credit to correct lexical choice than to word order”?

How is NIST defined, anyway?

…see the formula, blindly copied from the whiteboard.

We have not found the motivation for their above cited claim.

F1-score

Why have not they come up with a more informative name for this metric (F-score applied to dependency parses)? It remains implicit that the score is computed (most probably) from machine-created parses. This might have given some preference to some of the SMT systems.

_

“(…) estimated more conservatively by approximate randomization than by bootstrap tests, thus increasing the likelihood of type-I error for the latter.”

…what a strange wording! The approximative randomisation is not to blame for errors of bootstrap.

_

They say that approximate randomisation is more conservative. But does this mean it is better for practical application??

…well, they aim at the assumption noted in 4.3: “Assuming equivalence of the compared system variants, these assessments would count as type-I errors.”

_

At the end of the day, we want to extrapolate the significance measuring results to all possible inputs anyway. Hence, it is reasonable to adopt the assumption of bootstrap, namely that the sampling distribution is representative of the entire input space.

p-value confusion

Near the top of 4.4, p-value gets confused with the probability of type-I errors.

_

Final note: the paper is perhaps too condensed, and hard to grasp at places.