====== Michael Collins, Nigel Duffy: Convolution kernels for natural language ===== [[http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.69.8830&rep=rep1&type=pdf|Paper link]] ==== Questions ==== - What is a generative model, what is a discriminative model and what is their main difference? - What are the "fairly strong independence assumptions" in PCFG? Come up with an example tree that can't be modelled by a PCFG. - Derive and explain the formula for h(T1)*h(T2) on page 3 at the bottom. - What is a convolution? Why are "convolution" kernels called like this? - Find an error in one of the formulae in the paper. ==== Answers ==== - * **Generative models** use a two-step setup. They learn class-conditional (likelihood) P(x|y), prior P(y) and use the Bayes rule to obtain the posterior. * they learn the joint distributions: marginalize P(y), condition P(y|x) = P(x,y) / P(x) * They learn more than is actually needed, but are not prone to partially missing input data. * They are able to "generate" fake inputs, but this feat is not used very often. * Examples: Naive Bayes, Mixtures of Gaussians, HMM, Bayesian Networks, Markov Random Fields * **Discriminative models** do everything in one-step -- they learn the posterior P(y|x) as a function of some features of x. * They are simpler and can use many more features, but are prone to missing inputs. * Examples: SVM, Logistic Regression, Neural network, k-NN, Conditional Random Fields - Each CFG rule generates just one level of the derivation tree. Therefore, using "standard" nonterminals, it is not possible to generate e.g. this sentence: * ''(S (NP (PRP He)) (VP (VBD saw)(NP (PRP himself))))'' * It could be modelled with an augmentation of the nonterminal labels. * CFGs can't generate non-projective sentences. * But they can be modelled using traces. - The derivation is actually quite simple: - h(T_a)\cdot h(T_b) = \sum_i h_i(T_a) \cdot h_i(T_b) -- (definition of the dot product) - = \sum_i \left(\sum_{n_a \in N_a} I_i(n_a)\right) \left(\sum_{n_b \in N_b} I_i(n_b)\right) (from the definition of I in the paragraph above the formula) - = \sum_i\sum_{n_a \in N_a}\sum_{n_b \in N_b} I_i(n_b)\cdot I_i(n_a) (since (a+b)(c+d) = ac+ad+bc+bd) - = \sum_{n_a \in N_a}\sum_{n_b \in N_b}\sum_i I_i(n_b)\cdot I_i(n_a) (change summation order) - = \sum_{n_a \in N_a}\sum_{n_b \in N_b}C(n_a, n_b) (definition of C ) - Convolution is defined like this: (f*g)_k = \sum_i f_i g_{k-i}, so it measures the presence of structures that //complement// each other. Here, we have a measure of structures that are //similar//. So it is something different. But the main idea is the same -- we can combine smaller structures (kernels) into more complex ones. - There is a (tiny) error in the last formula of Section 3. You cannot actually multiply tree parses, so it should read: \bar{w}^{*} \cdot h(\mathbf{x}) = \dots ==== Report ==== We discussed the answers to the questions most of the time. Other issues raised in the discussion were: * **Usability** -- the approach is only usable for //reranking// the output of some other parser. * **Scalability** -- they only use 800 sentences and 20 candidates per sentence for training. We believe that for large data (milions of examples) this will become too complex. * **Evaluation** -- it looks as if they used a non-standard evaluation metric to get "better" results. The standard here would be F1-score.