[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2012:sigtest-mt [2012/11/12 16:59]
tamchyna
courses:rg:2012:sigtest-mt [2012/11/12 17:43] (current)
tamchyna
Line 47: Line 47:
  
 Generated test sets are of the same size as the original test set. We don't want to take less because it would weaken our test. Generated test sets are of the same size as the original test set. We don't want to take less because it would weaken our test.
 +
 +Bootstrap resampling gives us K BLEU scores. We sort them, discard the top and bottom 2.5% (so 95% of the scores remain). The minimum and maximum then defines the confidence interval.
 +
 +We do not assume normal distribution (answering Question 3) of the scores -- but they will be normally distributed anyway (we don't know if we could prove it though). Bootstrap resampling is distribution-independent in general.
 +
 +==== Section 6 ====
 +
 +Boostrap resampling for pairwise comparisons. Compute the BLEU scores for each system on all bootstrapped test sets. If A beats B in 950 out of 1000 test sets, then A is better with 95% confidence.
 +
 +Answering Question 4: A is better with 97% confidence (and at least as good with 98% confidence).
 +
 +We don't think this corresponds to confidence -- it's just the proportion of times that A beat B, a kind of ML estimate. In fact, we could use the differences between systems as input for a paired t-test (and get a true confidence interval).
 +
 +Notes on p-value: concepts of confidence intervals/significance testing were developed independently by different researches, then they were somehow merged. P-value is often misunderstood and misused.
 +
 +Question 5: let's define X as the population (all possible sentences), we have 2 systems evaluated on a sample S.
 +The test statistic T is the difference between A and B, T = delta(score(A), score(B)).
 +Null hypothesis is "true score of A is equal to B".
 +p-value ... in general, the probability of observing data at least as extreme given that the null hypothesis holds, formally:
 +
 +  p-value = P(T(X) >= T(S) | H_0)
 +
 +We do not prove our (alternative) hypothesis -- we can (with some confidence) reject the null hypothesis.
 +
 +The p-value is **not** the probability of the null hypothesis. It only says how (im)probable the **data** is given H_0, i.e. P(S|H_0), not P(H_0|S).

[ Back to the navigation ] [ Back to the content ]