Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:sigtest-mt [2012/11/12 16:39] tamchyna |
courses:rg:2012:sigtest-mt [2012/11/12 17:43] (current) tamchyna |
||
---|---|---|---|
Line 13: | Line 13: | ||
===== Presentation ===== | ===== Presentation ===== | ||
+ | |||
+ | ==== Introduction ==== | ||
Philipp Koehn' | Philipp Koehn' | ||
How to obtain multiple translation systems? Translate into English, from a number of different languages (trained on Europarl). | How to obtain multiple translation systems? Translate into English, from a number of different languages (trained on Europarl). | ||
+ | |||
+ | ==== Section 3 ==== | ||
Initial experiment: divide 30000 translated sentences into consecutive chunks of 300 sentences (100 test sets). BLEUs measured on individual test sets then vary quite a bit. | Initial experiment: divide 30000 translated sentences into consecutive chunks of 300 sentences (100 test sets). BLEUs measured on individual test sets then vary quite a bit. | ||
+ | Compared with **broad sampling**, created 100 test sets: | ||
+ | 1, 301, 601,... test set 1 | ||
+ | 2, 302, 602,... test set 2 | ||
+ | |||
+ | BLEU scores become more stable => this procedure leads to a more representative test set. | ||
+ | |||
+ | We are not sure whether this can really be generalized -- we only have two very similar systems (identical systems, only trained on different data). | ||
+ | |||
+ | ==== Section 4 ==== | ||
+ | |||
+ | Significance tests are used to estimate an interval in which the true system score lies. We use Student' | ||
+ | |||
+ | We are interested in the mean of sentence scores, which (given enough samples) is normally distributed according to the Central Limit Theorem, so the assumption is OK. | ||
+ | |||
+ | ==== Section 5 ==== | ||
+ | |||
+ | We cannot use the t-distribution for BLEU because it cannot be factorised like the metric in Section 4: BLEU computed on the full test set != average of per-sentence BLEU scores. | ||
+ | |||
+ | We can (reliably) estimate confidence intervals by translating a large number of test sets and observing the mean and variance. But we don't want to do that. | ||
+ | |||
+ | The key assumption: drawing sentences from one test set with replacement K times is as good as having K different test sets (for estimating the mean and variance of the BLEU score). | ||
+ | |||
+ | Generated test sets are of the same size as the original test set. We don't want to take less because it would weaken our test. | ||
+ | |||
+ | Bootstrap resampling gives us K BLEU scores. We sort them, discard the top and bottom 2.5% (so 95% of the scores remain). The minimum and maximum then defines the confidence interval. | ||
+ | |||
+ | We do not assume normal distribution (answering Question 3) of the scores -- but they will be normally distributed anyway (we don't know if we could prove it though). Bootstrap resampling is distribution-independent in general. | ||
+ | |||
+ | ==== Section 6 ==== | ||
+ | |||
+ | Boostrap resampling for pairwise comparisons. Compute the BLEU scores for each system on all bootstrapped test sets. If A beats B in 950 out of 1000 test sets, then A is better with 95% confidence. | ||
+ | |||
+ | Answering Question 4: A is better with 97% confidence (and at least as good with 98% confidence). | ||
+ | |||
+ | We don't think this corresponds to confidence -- it's just the proportion of times that A beat B, a kind of ML estimate. In fact, we could use the differences between systems as input for a paired t-test (and get a true confidence interval). | ||
+ | |||
+ | Notes on p-value: concepts of confidence intervals/ | ||
+ | |||
+ | Question 5: let's define X as the population (all possible sentences), we have 2 systems evaluated on a sample S. | ||
+ | The test statistic T is the difference between A and B, T = delta(score(A), | ||
+ | Null hypothesis is "true score of A is equal to B". | ||
+ | p-value ... in general, the probability of observing data at least as extreme given that the null hypothesis holds, formally: | ||
+ | |||
+ | p-value = P(T(X) >= T(S) | H_0) | ||
+ | |||
+ | We do not prove our (alternative) hypothesis -- we can (with some confidence) reject the null hypothesis. | ||
+ | |||
+ | The p-value is **not** the probability of the null hypothesis. It only says how (im)probable the **data** is given H_0, i.e. P(S|H_0), not P(H_0|S). |