### Table of Contents

# Michael Collins, Nigel Duffy: Convolution kernels for natural language

### Questions

- What is a generative model, what is a discriminative model and what is their main difference?
- What are the “fairly strong independence assumptions” in PCFG? Come up with an example tree that can't be modelled by a PCFG.
- Derive and explain the formula for h(T1)*h(T2) on page 3 at the bottom.
- What is a convolution? Why are “convolution” kernels called like this?
- Find an error in one of the formulae in the paper.

### Answers

**Generative models**use a two-step setup. They learn class-conditional (likelihood) , prior and use the Bayes rule to obtain the posterior.- they learn the joint distributions: marginalize P(y), condition P(y|x) = P(x,y) / P(x)
- They learn more than is actually needed, but are not prone to partially missing input data.
- They are able to “generate” fake inputs, but this feat is not used very often.
- Examples: Naive Bayes, Mixtures of Gaussians, HMM, Bayesian Networks, Markov Random Fields

**Discriminative models**do everything in one-step – they learn the posterior as a function of some features of .- They are simpler and can use many more features, but are prone to missing inputs.
- Examples: SVM, Logistic Regression, Neural network, k-NN, Conditional Random Fields

- Each CFG rule generates just one level of the derivation tree. Therefore, using “standard” nonterminals, it is not possible to generate e.g. this sentence:
`(S (NP (PRP He)) (VP (VBD saw)(NP (PRP himself))))`

- It could be modelled with an augmentation of the nonterminal labels.

- CFGs can't generate non-projective sentences.
- But they can be modelled using traces.

- The derivation is actually quite simple:
- – (definition of the dot product)
- (from the definition of in the paragraph above the formula)
- (since )
- (change summation order)
- (definition of )

- Convolution is defined like this: , so it measures the presence of structures that
*complement*each other. Here, we have a measure of structures that are*similar*. So it is something different. But the main idea is the same – we can combine smaller structures (kernels) into more complex ones. - There is a (tiny) error in the last formula of Section 3. You cannot actually multiply tree parses, so it should read:

### Report

We discussed the answers to the questions most of the time. Other issues raised in the discussion were:

**Usability**– the approach is only usable for*reranking*the output of some other parser.**Scalability**– they only use 800 sentences and 20 candidates per sentence for training. We believe that for large data (milions of examples) this will become too complex.**Evaluation**– it looks as if they used a non-standard evaluation metric to get “better” results. The standard here would be F1-score.