Table of Contents

Michael Collins, Nigel Duffy: Convolution kernels for natural language

Paper link

Questions

  1. What is a generative model, what is a discriminative model and what is their main difference?
  2. What are the “fairly strong independence assumptions” in PCFG? Come up with an example tree that can't be modelled by a PCFG.
  3. Derive and explain the formula for h(T1)*h(T2) on page 3 at the bottom.
  4. What is a convolution? Why are “convolution” kernels called like this?
  5. Find an error in one of the formulae in the paper.

Answers

    • Generative models use a two-step setup. They learn class-conditional (likelihood) <latex>P(x|y)</latex>, prior <latex>P(y)</latex> and use the Bayes rule to obtain the posterior.
      • they learn the joint distributions: marginalize P(y), condition P(y|x) = P(x,y) / P(x)
      • They learn more than is actually needed, but are not prone to partially missing input data.
      • They are able to “generate” fake inputs, but this feat is not used very often.
      • Examples: Naive Bayes, Mixtures of Gaussians, HMM, Bayesian Networks, Markov Random Fields
    • Discriminative models do everything in one-step – they learn the posterior <latex>P(y|x)</latex> as a function of some features of <latex> x</latex>.
      • They are simpler and can use many more features, but are prone to missing inputs.
      • Examples: SVM, Logistic Regression, Neural network, k-NN, Conditional Random Fields
  1. Each CFG rule generates just one level of the derivation tree. Therefore, using “standard” nonterminals, it is not possible to generate e.g. this sentence:
    • (S (NP (PRP He)) (VP (VBD saw)(NP (PRP himself))))
      • It could be modelled with an augmentation of the nonterminal labels.
    • CFGs can't generate non-projective sentences.
      • But they can be modelled using traces.
  2. The derivation is actually quite simple:
    1. <latex>h(T_a)\cdot h(T_b) = \sum_i h_i(T_a) \cdot h_i(T_b)</latex> – (definition of the dot product)
    2. <latex>= \sum_i \left(\sum_{n_a \in N_a} I_i(n_a)\right) \left(\sum_{n_b \in N_b} I_i(n_b)\right)</latex> (from the definition of <latex>I</latex> in the paragraph above the formula)
    3. <latex>= \sum_i\sum_{n_a \in N_a}\sum_{n_b \in N_b} I_i(n_b)\cdot I_i(n_a)</latex> (since <latex>(a+b)(c+d) = ac+ad+bc+bd</latex>)
    4. <latex>= \sum_{n_a \in N_a}\sum_{n_b \in N_b}\sum_i I_i(n_b)\cdot I_i(n_a)</latex> (change summation order)
    5. <latex>= \sum_{n_a \in N_a}\sum_{n_b \in N_b}C(n_a, n_b)</latex> (definition of <latex> C </latex>)
  3. Convolution is defined like this: <latex>(f*g)_k = \sum_i f_i g_{k-i}</latex>, so it measures the presence of structures that complement each other. Here, we have a measure of structures that are similar. So it is something different. But the main idea is the same – we can combine smaller structures (kernels) into more complex ones.
  4. There is a (tiny) error in the last formula of Section 3. You cannot actually multiply tree parses, so it should read: <latex>\bar{w}^{*} \cdot h(\mathbf{x}) = \dots</latex>

Report

We discussed the answers to the questions most of the time. Other issues raised in the discussion were: