Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
courses:rg:2013:false-positive-psychology [2013/10/10 00:19] popel vytvořeno |
courses:rg:2013:false-positive-psychology [2013/10/14 15:47] (current) popel answers |
||
---|---|---|---|
Line 4: | Line 4: | ||
===== Questions ===== | ===== Questions ===== | ||
- | - Are the described issues (researcher degrees of freedom etc.) relevant also for NLP research (papers)? Can you name some similarities and differences (concerning the described issues) between NLP and psychological research, ideally with concrete examples? | + | - Are the described issues (researcher degrees of freedom etc.) relevant also for NLP research (papers)? Can you name some similarities and differences (concerning the described issues) between NLP and psychological research? |
- What does it mean when "p ≤ 0.05"? I.e., what is the definition of p-value? | - What does it mean when "p ≤ 0.05"? I.e., what is the definition of p-value? | ||
- | - What does " | + | - What does " |
+ | ===== Answers ===== | ||
+ | |||
+ | ==== 1. ==== | ||
+ | Most of the described issues are relevant for NLP as well (cf. hyper-parameters, | ||
+ | When human evaluation is involved in NLP, it shares many methodological properties/ | ||
+ | However, in many NLP tasks we have (only) automatic evaluation (based on human-annotated gold data). | ||
+ | |||
+ | Psy: "find evidence that an effect exists" | ||
+ | NLP: "our method for solving xy is better" | ||
+ | |||
+ | NLP: easier to replicate experiments ("code and Makefiles" | ||
+ | PSY: replication costs money and time, you need different people (so they are not influenced) | ||
+ | |||
+ | The last point is quite important. You may suggest the same person should listen to both Beatles and Kalimba (after some time), but there is a risk of long-term effect of the first experiment influencing (skewing) the second one. | ||
+ | |||
+ | ==== 2. ==== | ||
+ | P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (see Wikipedia [[http:// | ||
+ | |||
+ | Using a formula for one-tailed test: p-value = P( X≥x | H0) | ||
+ | where X is the statistics we are measuring (e.g. difference between average age of Kalimba-listeners and average age of Beatles-listeners), | ||
+ | x is the value of X we have actually measured (e.g. 1.4 years), | ||
+ | and H0 is the null hypothesis (no effect, no difference between the two groups, i.e. X=0, the difference has normal distribution with mean 0). | ||
+ | | ||
+ | | ||
+ | If you set the traditional significance level to 0.05, you get a false positive case when p<0.05, but the null hypothesis holds. | ||
+ | false-positive-rate = | ||
+ | = P(p-value < 0.05 & H0) | ||
+ | = P( P( X≥x | H0) < 0.05 & H0) != p-value | ||
+ | |||
+ | ==== 3. ==== | ||
+ | Alpha was originally defined by Neyman & Pearson as Type I error rate, but this is incompatible with p-value and Fisher' | ||
+ | Alpha is also used as a name for the " | ||
+ | When multiple experiments are tested, we should decrease this threshold (but how?). | ||
+ | See [[http:// |