Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:sigtest-mt-zilka [2012/11/14 18:02]
zilka
+++ courses:rg:2012:sigtest-mt-zilka [2013/12/02 22:18] (current)
popel
@@ Line 13: / Line 13: @@
 Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples?
 ===== Question 4 =====
-We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 ... -5,0,0,0,0,0,0,0,0,0,0,1,2,3 ... 970.
+We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 ... -50,0,0,0,0,0,0,0,0,0,0,1,2,3 ... 970.
 Based on Section 6, which system is better - A or B?
@@ Line 34: / Line 34: @@
 ====== Presentation ======
   * We answered:
-    * Question 1 - BLEU scores are: 1 - 1.0, 2 - 0.0 (or some smoothed value), 3 - 0.2
+    * Question 1 - BLEU scores are: 1 - 1.0, 2 - not defined (0.0 or some smoothed value in practice), 3 - 0.2 (based on the incorrect formula in the paper which is missing 1/4)
     * Question 2 - broad sampling, samples far apart distributed -> {data_1, data_101, data_201, ...}
@@ Line 43: / Line 43: @@
     * non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 %
   * they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data
-    * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggest we can generalize like that
+    * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggests we can generalize like that
 ===== Section 4, 5 =====
@@ Line 56: / Line 56: @@
 ===== Martin's explanation of p-values =====
-  * two philosophical views of p-value - Fisher's and Person's - unfortunately their are mixed in modern textbooks which only confuses us
+  * two philosophical views of p-value - Fisher's and Pearson's - unfortunately their are mixed in modern textbooks which only confuses us
-  * we always set a null hypothesis H0 as: systems are the same, and alternative hypothesis HA: there is difference in the systems; P(H0) + P(HA) = 1
+  * we usually set a null hypothesis H0 as: systems are the same, and alternative hypothesis HA: there is difference in the systems; P(H0) + P(HA) = 1
   * p-value =
     * P(T(X)>=T(x_orig)|H0) = P(x|H0) = //if the compared systems are the same, what's the probability that we see this data//
-    * unfortunately we tend to view the p-value as P(H0|x) which it is not and we need to apply the Bayes's theorem to get it
+    * unfortunately we tend to view the p-value as P(H0|x) which it is not and we need to apply the Bayes' theorem to get it
-  * bootstrap resampling can be viewed as p-value=P(d(x) > d(x_orig)|H0), and is approximated by S/B; where S is number of system 1 beating system 2 and B is number of measurements
+  * bootstrap resampling can be viewed as p-value = P(d(x) < 0|H0) = P(d(x) > 2*d(x_orig)|H0), and is approximated by S/B; where S is number of system 2 beating system 1 and B is number of measurements

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences