 ===Martin's questions===

1)
How would you implement approximate randomization for BLEU based on Figure 1,
namely the part "Shuffle variable tuples between system X and Y with probability 0.5"?
What are the variable tuples? Can you write a more detailed pseudo (or C,Java,Perl,...) code?
How would you implement the next part "Compute pseudo-statistic |S_Xr − S_Yr | on shuffled data"?

2)
On a testset of 1000 sentences, systems X and Y have exactly the same output except for one sentence:
REF = Hello
MT_X= Hello
MT_Y= Hi
You computed approximate randomization test (based on Figure 1, R=10000 samples)
to check whether the improvement in BLEU is significant. What were the results (i.e. p-value)?

3)
What would be the p-value for bootstrap test based on a) Figure 2, b) Koehn2004 (the last RG paper)?
This is a bit tricky. Just estimate the expected value of p-value (i.e. 1 - level_of_confidence).

4)
What would be the p-value for non-strict inequality, i.e. hypothesis "system X is better or equal than Y"?

1. The question aimed to find out whether we would repeatedly count the matching n-grams between the MT output and the reference. They can be pre-computed for each sentence and then aggregated without recurring to string matching.

