Michael Collins: Discriminative Training Methods for Hidden Markov Models: Theory and Experiments with Perceptron Algorithms, EMNLP 2002.
1. Suppose you have a tagset consisting of two tags, N(noun), X(not noun) and a training sentence:
Luke/N I/X am/X your/X father/N
During the training, this best tag sequence for this sentence is found:
N N X N X
How would this result alter values of αX,X,X and αN,father?
Supposing that the best tag sequence won't change, what would be your answer if “father/N” is replaced by “Luke/X”?
2. Suppose this tagged sentence as the only entry in your training data:
a/DT boy/NN saw/VBD a/DT girl/NN with/IN a/DT nice/JJ hat/NN
How many features will a tagger from section 2.4 have, when its training is identical to the one from section 2.1?
(For some reasons, you want to use all 36 tags from Penn Treebank tagset.)
Which tag sequence (z[1:n_1]) will be selected for this first sentence in the first iteration of algorithm?
3. What is the difference between Maximum-Entropy and the Perceptron model training in the experiments?
4. Do you think that this task can be parallelized?
How do you think the performance of tagger presented in the paper will change when you introduce parallelism?
5. Can be Theorem 2 generalized to define upper bound of mistakes of algorithm during whole training? (Not just first pass.)