[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2012:sigtest-mt-zilka [2012/11/14 18:02]
zilka
courses:rg:2012:sigtest-mt-zilka [2013/12/02 22:18] (current)
popel
Line 13: Line 13:
 Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples? Does the bootstrap resampling (Section 5) assume normal (Gaussian) distribution of the scores of samples?
 ===== Question 4 ===== ===== Question 4 =====
-We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 ... -5,0,0,0,0,0,0,0,0,0,0,1,2,3 ... 970. +We bootstrapped 1000 test sets, computed scoreA-scoreB on each, and we got -1000,-950,-900,-850 ... -50,0,0,0,0,0,0,0,0,0,0,1,2,3 ... 970. 
  
 Based on Section 6, which system is better - A or B? Based on Section 6, which system is better - A or B?
Line 34: Line 34:
 ====== Presentation ====== ====== Presentation ======
   * We answered:   * We answered:
-    * Question 1 - BLEU scores are: 1 - 1.0, 2 - 0.0 (or some smoothed value), 3 - 0.2+    * Question 1 - BLEU scores are: 1 - 1.0, 2 - not defined (0.0 or some smoothed value in practice), 3 - 0.2 (based on the incorrect formula in the paper which is missing 1/4)
     * Question 2 - broad sampling, samples far apart distributed -> {data_1, data_101, data_201, ...}     * Question 2 - broad sampling, samples far apart distributed -> {data_1, data_101, data_201, ...}
  
Line 43: Line 43:
     * non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 %     * non-consecutive samples (broad apart) - for each of the sets BLEU varies much less - +-1.5 %
   * they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data   * they make an assumption and claim that there is no difference between comparing output of 2 different MT systems and output of 1 MT systems that is trained just with different data
-    * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggest we can generalize like that+    * Lukas Zilka complained about this assumption - they should have conducted some experiments to support their claim, as there is nothing that suggests we can generalize like that
  
 ===== Section 4, 5 ===== ===== Section 4, 5 =====
Line 56: Line 56:
  
 ===== Martin's explanation of p-values ===== ===== Martin's explanation of p-values =====
-  * two philosophical views of p-value - Fisher's and Person's - unfortunately their are mixed in modern textbooks which only confuses us +  * two philosophical views of p-value - Fisher's and Pearson's - unfortunately their are mixed in modern textbooks which only confuses us 
-  * we always set a null hypothesis H0 as: systems are the same, and alternative hypothesis HA: there is difference in the systems; P(H0) + P(HA) = 1+  * we usually set a null hypothesis H0 as: systems are the same, and alternative hypothesis HA: there is difference in the systems; P(H0) + P(HA) = 1
   * p-value =   * p-value =
     * P(T(X)>=T(x_orig)|H0) = P(x|H0) = //if the compared systems are the same, what's the probability that we see this data//     * P(T(X)>=T(x_orig)|H0) = P(x|H0) = //if the compared systems are the same, what's the probability that we see this data//
-    * unfortunately we tend to view the p-value as P(H0|x) which it is not and we need to apply the Bayes'theorem to get it +    * unfortunately we tend to view the p-value as P(H0|x) which it is not and we need to apply the Bayes' theorem to get it 
-  * bootstrap resampling can be viewed as p-value=P(d(x) > d(x_orig)|H0), and is approximated by S/B; where S is number of system beating system and B is number of measurements+  * bootstrap resampling can be viewed as p-value = P(d(x) < 0|H0) = P(d(x) > 2*d(x_orig)|H0), and is approximated by S/B; where S is number of system beating system and B is number of measurements
  

[ Back to the navigation ] [ Back to the content ]