[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]

Table of Contents

Michael Collins, Nigel Duffy: Convolution kernels for natural language

Paper link


  1. What is a generative model, what is a discriminative model and what is their main difference?
  2. What are the “fairly strong independence assumptions” in PCFG? Come up with an example tree that can't be modelled by a PCFG.
  3. Derive and explain the formula for h(T1)*h(T2) on page 3 at the bottom.
  4. What is a convolution? Why are “convolution” kernels called like this?
  5. Find an error in one of the formulae in the paper.


    • Generative models use a two-step setup. They learn class-conditional (likelihood) Graph, prior Graph and use the Bayes rule to obtain the posterior.
      • they learn the joint distributions: marginalize P(y), condition P(y|x) = P(x,y) / P(x)
      • They learn more than is actually needed, but are not prone to partially missing input data.
      • They are able to “generate” fake inputs, but this feat is not used very often.
      • Examples: Naive Bayes, Mixtures of Gaussians, HMM, Bayesian Networks, Markov Random Fields
    • Discriminative models do everything in one-step – they learn the posterior Graph as a function of some features of Graph.
      • They are simpler and can use many more features, but are prone to missing inputs.
      • Examples: SVM, Logistic Regression, Neural network, k-NN, Conditional Random Fields
  1. Each CFG rule generates just one level of the derivation tree. Therefore, using “standard” nonterminals, it is not possible to generate e.g. this sentence:
    • (S (NP (PRP He)) (VP (VBD saw)(NP (PRP himself))))
      • It could be modelled with an augmentation of the nonterminal labels.
    • CFGs can't generate non-projective sentences.
      • But they can be modelled using traces.
  2. The derivation is actually quite simple:
    1. Graph – (definition of the dot product)
    2. Graph (from the definition of in the paragraph above the formula)
    3. Graph (since Graph)
    4. Graph (change summation order)
    5. Graph (definition of Graph)
  3. Convolution is defined like this: Graph, so it measures the presence of structures that complement each other. Here, we have a measure of structures that are similar. So it is something different. But the main idea is the same – we can combine smaller structures (kernels) into more complex ones.
  4. There is a (tiny) error in the last formula of Section 3. You cannot actually multiply tree parses, so it should read: Graph


We discussed the answers to the questions most of the time. Other issues raised in the discussion were:

[ Back to the navigation ] [ Back to the content ]