Differences
This shows you the differences between two versions of the page.
| Both sides previous revision Previous revision Next revision | Previous revision | ||
|
courses:rg:2013:false-positive-psychology [2013/10/10 00:53] popel |
courses:rg:2013:false-positive-psychology [2013/10/14 15:47] (current) popel answers |
||
|---|---|---|---|
| Line 8: | Line 8: | ||
| - What does " | - What does " | ||
| + | ===== Answers ===== | ||
| + | |||
| + | ==== 1. ==== | ||
| + | Most of the described issues are relevant for NLP as well (cf. hyper-parameters, | ||
| + | When human evaluation is involved in NLP, it shares many methodological properties/ | ||
| + | However, in many NLP tasks we have (only) automatic evaluation (based on human-annotated gold data). | ||
| + | |||
| + | Psy: "find evidence that an effect exists" | ||
| + | NLP: "our method for solving xy is better" | ||
| + | |||
| + | NLP: easier to replicate experiments ("code and Makefiles" | ||
| + | PSY: replication costs money and time, you need different people (so they are not influenced) | ||
| + | |||
| + | The last point is quite important. You may suggest the same person should listen to both Beatles and Kalimba (after some time), but there is a risk of long-term effect of the first experiment influencing (skewing) the second one. | ||
| + | |||
| + | ==== 2. ==== | ||
| + | P-value is the probability of obtaining a test statistic at least as extreme as the one that was actually observed, assuming that the null hypothesis is true. (see Wikipedia [[http:// | ||
| + | |||
| + | Using a formula for one-tailed test: p-value = P( X≥x | H0) | ||
| + | where X is the statistics we are measuring (e.g. difference between average age of Kalimba-listeners and average age of Beatles-listeners), | ||
| + | x is the value of X we have actually measured (e.g. 1.4 years), | ||
| + | and H0 is the null hypothesis (no effect, no difference between the two groups, i.e. X=0, the difference has normal distribution with mean 0). | ||
| + | | ||
| + | | ||
| + | If you set the traditional significance level to 0.05, you get a false positive case when p<0.05, but the null hypothesis holds. | ||
| + | false-positive-rate = | ||
| + | = P(p-value < 0.05 & H0) | ||
| + | = P( P( X≥x | H0) < 0.05 & H0) != p-value | ||
| + | |||
| + | ==== 3. ==== | ||
| + | Alpha was originally defined by Neyman & Pearson as Type I error rate, but this is incompatible with p-value and Fisher' | ||
| + | Alpha is also used as a name for the " | ||
| + | When multiple experiments are tested, we should decrease this threshold (but how?). | ||
| + | See [[http:// | ||
