Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:sigtest-mt-zilka [2012/11/14 17:45] zilka |
courses:rg:2012:sigtest-mt-zilka [2013/12/02 22:18] (current) popel |
||
---|---|---|---|
Line 13: | Line 13: | ||
Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples? | Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples? | ||
===== Question 4 ===== | ===== Question 4 ===== | ||
- | We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000, | + | We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000, |
Based on Section 6, which system is better - A or B? | Based on Section 6, which system is better - A or B? | ||
Line 34: | Line 34: | ||
====== Presentation ====== | ====== Presentation ====== | ||
* We answered: | * We answered: | ||
- | * Question 1 - BLEU scores are: 1 - 1.0, 2 - 0.0 (or some smoothed value), 3 - 0.2 | + | * Question 1 - BLEU scores are: 1 - 1.0, 2 - not defined (0.0 or some smoothed value in practice), 3 - 0.2 (based on the incorrect formula in the paper which is missing 1/4) |
* Question 2 - broad sampling, samples far apart distributed -> {data_1, data_101, data_201, ...} | * Question 2 - broad sampling, samples far apart distributed -> {data_1, data_101, data_201, ...} | ||
Line 43: | Line 43: | ||
* non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 % | * non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 % | ||
* they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data | * they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data | ||
- | * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggest | + | * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggests |
===== Section 4, 5 ===== | ===== Section 4, 5 ===== | ||
* we cannot use Student' | * we cannot use Student' | ||
- | * so for estimating the confidence intervals we will use randomized test set generation - e.g. we build 1000 new test sets of size 300 sentences out of our small test set of 300 sentences (i.e. we draw (with replacement) samples from the small test set; so we should get 1000 different test sets) | + | * so for estimating the confidence intervals we will use randomized test set generation |
* answer to Question3 - they do not assume there is any particular distribution in the set of BLEU scores of the 1000 test sets (i.e. their method would work regardless of whether the distribution is normal, uniform or any other), but it is perhaps normally distributed | * answer to Question3 - they do not assume there is any particular distribution in the set of BLEU scores of the 1000 test sets (i.e. their method would work regardless of whether the distribution is normal, uniform or any other), but it is perhaps normally distributed | ||
===== Section 6 ===== | ===== Section 6 ===== | ||
+ | * they use bootstrap resampling to compare 2 systems; we want to determine whether system 1 is better than system 2; we want to determine that from a set of differences of system' | ||
+ | * so we determine in what percent of cases system 1 beats system 2, and that's our final confidence that system 1 is better than system 2 (e.g. 45 times out of 50 -> 90% confidence) | ||
+ | * the rest of the paper just proves that the assumption is correct | ||
+ | ===== Martin' | ||
+ | * two philosophical views of p-value - Fisher' | ||
+ | * we usually set a null hypothesis H0 as: systems are the same, and alternative hypothesis HA: there is difference in the systems; P(H0) + P(HA) = 1 | ||
+ | * p-value = | ||
+ | * P(T(X)> | ||
+ | * unfortunately we tend to view the p-value as P(H0|x) which it is not and we need to apply the Bayes' theorem to get it | ||
+ | * bootstrap resampling can be viewed as p-value = P(d(x) < 0|H0) = P(d(x) > 2*d(x_orig)|H0), | ||
- | |||
- | |||
- | * **Section 3** describes the data |