====== N-best Reranking by Multitask Learning ======
Kevin Duh, Katsuhito Sudoh, Hajime Tsukada, Hideki Isozaki, Masaaki Nagata
http://www.aclweb.org/anthology/W/W10/W10-1757.pdf
[[http://www.kecl.ntt.co.jp/icl/lirg/members/kevinduh/papers/duh10multitask-slides.pdf|Kevin's slides]]
ACL 5th Workshop on Statistical Machine Translation (WMT) 2010
===== Suggestions for the presenter =====
It would be great to have an illustrative but simple example of N-best list and also examples of features and examples of labels (to specify the terminology).
===== Comments =====
* L_p norm of a vector \vec{x}=(x_1, x_2,...,x_n) is defined as ||\vec{x}||_p = (\sum_i |x_i|^p )^{1/p} , so e.g. L_1 norm is simply a sum of absolute values. L_p norm is sometimes called also \fract{l}_p norm or just p-norm.
* Every feature is only fired (i.e. has non-zero value) at the sentence where the conditions are met.
* Example: 500 sentences, every sentence has one N-best list which is considered as a one task for multi-task learning. That means 500 weight vectors must be trained.
* We wondered how RandomHashing (Weinberger et al., 2009; Ganchev and Dredze, 2008) actually //without suffering losses in fidelity//.
* Section 3.2 states: //"ExtractCommonFeature(W) then returns the feature id’s that receive nonzero weight in any of W."// This implies that the number of extracted common features is fixed given the set of weights W. In contrary, Sections 4.2 states: //"In particular, the most important is the number of common features to extract, which we pick from {250, 500, 1000}."//
* Karel Vandas suppose that it (i.e. 250,500,1000) is just number of input features, if they were really used is not clear.
* Martin Popel suppose that the authors actually use ExtractCommonFeature(W,z) which extracts z features with the highest weights (by computing 2-norm on columns of W).
> Answer by Kevin Duh:
> //I think some of the unclearness stems from the fact that I tried to present various multitask learning algorithms under the same framework, but in practice the details differ for each algorithm. The number of input (hashed) features is 4000. For Shared Subspace and Unsupervised Select, we picked from {250,500,1000} features, but for Joint Regularization we do ExtractCommonFeature(W).//
* According to Table 2, "Unsupervised FeatureSelect" resulted in 500 distinct features, "Feature threshold x > 10" resulted in 60 000 distinct features and "Unsupervised FeatureSelect + Feature threshold x > 10" resulted in 60 500 distinct features. This implies there was no overlap, in other words, all the features selected by "Unsupervised FeatureSelect" were rare, i.e. occurring 10 times or less in the training data. Similarly for "Joint Regularization + (b)" with 60 250 features and "Shared Subspace + (b)" with 61 000 features.
* This seems rather strange and in contrary to the findings in Section 4.2 that the multitask learning algorithms extract //widely applicable// (i.e. not rare) features, such as general non-lexicalized features or features involving function words.
* One solution is that "61k" does not mean "exactly 61 000", but some smaller number, let's say 60 777. In my point of view, the most interesting question is "What are those 777 features which are rare but useful?" (They are useful, because they cause the improvement of 29.6 - 29.0 = 0.6 BLEU.) The last paragraph of Section 4.2 describes only frequent features which could be expected to be useful, but I would like to see a similar description of rare useful features.
> Answer by Kevin Duh:
> //The three multitask methods work on hashed representation of features, and the thresholding is on the original features, so the number of features for "Unsupervised FeatureSelect + Feature Threshold x>10" is really 60,500 distinct features. I counted them separately (though in practice it is possible that some hashed features have close analogs to original features). We didn't do an quantitative analysis that tries to map selected hash features to original features, but instead did another experiment on smaller dataset that directly trains on the original features (see footnote 10): the result was that "rare" features were those involving conjunctions including function words and special characters.//
* Zdeněk Žabokrtský pointed out the similarity of the regularizers used in the paper and Bayes priors.
* Zdeněk's original email in Czech: Obycejna l2 regularizace na jedne uloze stahuje vektor vah smerem ke spicce hyperkuzelu, podobne jako by je do sveho stredu stahoval gausovsky prior. U te kombinovane l1/l2 pro multitask to zas stahuje jednotlive vahove vektory k sobe tak, ze ten regularizator ma tvar udoli s osou v diagonalnim smeru, ktere se navic svazuje do pocatku souradnic, kterym clovek reprezentuje apriorni znalost, ze ty vektory by mely byt pokud mozno podobne.
* This view is also supported by [[http://email.seznam.cz/redir?hashId=4138474683&to=http%3a%2f%2fen%2ewikipedia%2eorg%2fwiki%2fRegularization%5f%2528mathematics%2529|a citation]] //"A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters."//
* Page 380: //"The feature threshold alone achieves nice BLEU results (29.0 for x > 10), but the combination outperforms it by statistically significant margins (29.3 - 29.6)."// According to Table 2, I think (29.3 - 29.6) should be actually (29.6 - 29.0).
===== What do we dislike about the paper =====
* The authors selected 4 different baselines and the improvement (29.6 - 28.5 BLEU) over these baselines looks quite good. However, the very simple method of feature pruning using a threshold (e.g. x>10) is rather the method we would call //baseline//. The improvement against this method (29.6 - 29.0 BLEU) is not so impressive. Nevertheless, it is said to be still statistically significant (cf. the previous comment).
===== What do we like about the paper =====
* We have not known about multitask learning before. It seems we can apply multitask learning on many problems we are dealing with -- namely, it seems suitable for all NLP tasks where the training data consists of texts from several different domains (e.g. web, Europarl, news).
* We like the novel idea of using multitask learning for reranking.
* Although not mentioned explicitly in the paper, multitask learning is there used as a method of clever feature pruning. Feature engineering is becoming more and more important across all machine learning approaches.
Written by Karel Vandas and Martin Popel