[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2012:sigtest-mt [2012/11/12 16:54]
tamchyna
courses:rg:2012:sigtest-mt [2012/11/12 17:43] (current)
tamchyna
Line 40: Line 40:
 ==== Section 5 ==== ==== Section 5 ====
  
-We cannot use the t-distribution for BLEU because it cannot be factorised like the metric in Section 4: BLEU computed on the full test set != average of per-sentence BLEU scores+We cannot use the t-distribution for BLEU because it cannot be factorised like the metric in Section 4: BLEU computed on the full test set != average of per-sentence BLEU scores.
  
 +We can (reliably) estimate confidence intervals by translating a large number of test sets and observing the mean and variance. But we don't want to do that.
  
 +The key assumption: drawing sentences from one test set with replacement K times is as good as having K different test sets (for estimating the mean and variance of the BLEU score).
 +
 +Generated test sets are of the same size as the original test set. We don't want to take less because it would weaken our test.
 +
 +Bootstrap resampling gives us K BLEU scores. We sort them, discard the top and bottom 2.5% (so 95% of the scores remain). The minimum and maximum then defines the confidence interval.
 +
 +We do not assume normal distribution (answering Question 3) of the scores -- but they will be normally distributed anyway (we don't know if we could prove it though). Bootstrap resampling is distribution-independent in general.
 +
 +==== Section 6 ====
 +
 +Boostrap resampling for pairwise comparisons. Compute the BLEU scores for each system on all bootstrapped test sets. If A beats B in 950 out of 1000 test sets, then A is better with 95% confidence.
 +
 +Answering Question 4: A is better with 97% confidence (and at least as good with 98% confidence).
 +
 +We don't think this corresponds to confidence -- it's just the proportion of times that A beat B, a kind of ML estimate. In fact, we could use the differences between systems as input for a paired t-test (and get a true confidence interval).
 +
 +Notes on p-value: concepts of confidence intervals/significance testing were developed independently by different researches, then they were somehow merged. P-value is often misunderstood and misused.
 +
 +Question 5: let's define X as the population (all possible sentences), we have 2 systems evaluated on a sample S.
 +The test statistic T is the difference between A and B, T = delta(score(A), score(B)).
 +Null hypothesis is "true score of A is equal to B".
 +p-value ... in general, the probability of observing data at least as extreme given that the null hypothesis holds, formally:
 +
 +  p-value = P(T(X) >= T(S) | H_0)
 +
 +We do not prove our (alternative) hypothesis -- we can (with some confidence) reject the null hypothesis.
 +
 +The p-value is **not** the probability of the null hypothesis. It only says how (im)probable the **data** is given H_0, i.e. P(S|H_0), not P(H_0|S).

[ Back to the navigation ] [ Back to the content ]