Differences

This shows you the differences between two versions of the page.

--- courses:rg:reranking-by-multitask-learning [2010/10/22 13:56]
vandas Basics of commentars and discussion after the reading
+++ courses:rg:reranking-by-multitask-learning [2010/10/26 02:02]
popel added comments + tidy up
@@ Line 13: / Line 13: @@
   * <latex>L_p </latex> norm of a vector <latex>\vec{x}=(x_1, x_2,...,x_n)</latex> is defined as <latex>||\vec{x}||_p = (\sum_i |x_i|^p )^{1/p} </latex>, so e.g. <latex>L_1</latex> norm is simply a sum of absolute values. <latex>L_p</latex> norm is sometimes called also <latex>\fract{l}_p</latex> norm or just p-norm.
-===== Opinions on the paper =====
+  * Every feature is only fired (i.e. has non-zero value) at the sentence where the conditions are met.
+  * Example: 500 sentences, every sentence has one N-best list which is considered as a one task for multi-task learning. That means 500 weight vectors must be trained.
-TODO: suggestions to solve/comment
+  * We wondered how RandomHashing (Weinberger et al., 2009; Ganchev and Dredze, 2008) actually //without suffering losses in fidelity//.
-Research group suggested that they extract only those features that has a nonzero weight in any of W.
+  * Section 3.2 states: //"ExtractCommonFeature(W) then returns the feature id’s that receive nonzero weight in any of W."// This implies that the number of extracted common features is fixed given the set of weights W. In contrary, Sections 4.2 states: //"In particular, the most important is the number of common features to extract, which we pick from {250, 500, 1000}."//
+      * Karel Vandas suppose that it (i.e. 250,500,1000) is just number of input features, if they were really used is not clear.
+      * Martin Popel suppose that the authors actually use ExtractCommonFeature(W,z) which extracts z features with the highest weights (by computing 2-norm on columns of W).
-Comments by M. Popel:
+  * According to Table 2, "Unsupervised FeatureSelect" resulted in 500 distinct features, "Feature threshold x > 10" resulted in 60 000 distinct features and "Unsupervised FeatureSelect + Feature threshold x > 10" resulted in 60 500 distinct features. This implies there was no overlap, in other words, all the features selected by "Unsupervised FeatureSelect" were rare, i.e. occurring 10 times or less in the training data. Similarly for "Joint Regularization + (b)" with 60 250 features and "Shared Subspace + (b)" with 61 000 features.
-Feature pruning using a treshold: When you have limited data, according to this work it worth to try a good feature than to set a treshold.
+     * This seems rather strange and in contrary to the findings in Section 4.2 that the multitask learning algorithms extract //widely applicable// (i.e. not rare) features, such as general non-lexicalized features or features involving function words.
+     * One solution is that "61k" does not mean "exactly 61 000", but some smaller number, let's say 60 777. In my point of view, the most interesting question is "What are those 777 features which are rare but useful?" (They are useful, because they cause the improvement of 29.6 - 29.0 = 0.6 BLEU.) The last paragraph of Section 4.2 describes only frequent features which could be expected to be useful, but I would like to see a similar description of rare useful features.
-We were arguing about the number of features used in sets. It is unlikely that they could somehow get the fixed number of features.
+  * Zdeněk Žabokrtský pointed out the similarity of the regularizers used in the paper and Bayes priors.
-(I suppose that it is just number of input features, if they were really used is not clear.)
+      * Zdeněk's original email in Czech: <code>Obycejna l2 regularizace na jedne uloze stahuje vektor vah smerem ke spicce hyperkuzelu, podobne jako by je do sveho stredu stahoval gausovsky prior. U te kombinovane l1/l2 pro multitask to zas stahuje jednotlive vahove vektory k sobe tak, ze ten regularizator ma tvar udoli s osou v diagonalnim smeru, ktere se navic svazuje do pocatku souradnic, kterym clovek reprezentuje apriorni znalost, ze ty vektory by mely byt pokud mozno podobne.</code>
+      * This view is also supported by [[http://email.seznam.cz/redir?hashId=4138474683&to=http%3a%2f%2fen%2ewikipedia%2eorg%2fwiki%2fRegularization%5f%2528mathematics%2529|a citation]] //"A theoretical justification for regularization is that it attempts to impose Occam's razor on the solution. From a Bayesian point of view, many regularization techniques correspond to imposing certain prior distributions on model parameters."//
-Every feature is only fired at the sentence where the conditions are met.
+  * Page 380: //"The feature threshold alone achieves nice BLEU results (29.0 for x > 10), but the combination outperforms it by statistically significant margins (29.3 - 29.6)."// According to Table 2, I think (29.3 - 29.6) should be actually (29.6 - 29.0).
-Example: 500 sentences, every sentence has just one N-best list. That means 500 weight vectors
-We argued about hashing the features together - in what way are they hashed?
+===== What do we dislike about the paper =====
+  * The authors selected 4 different baselines and the improvement (29.6 - 28.5 BLEU) over these baselines looks quite good. However, the very simple method of feature pruning using a threshold (e.g. x>10) is rather the method we would call //baseline//. The improvement against this method (29.6 - 29.0 BLEU) is not so impressive. Nevertheless, it is said to be still statistically significant (cf. the previous comment).
+===== What do we like about the paper =====
+  * We have not known about multitask learning before. It seems we can apply multitask learning on many problems we are dealing with -- namely, it seems suitable for all NLP tasks where the training data consists of texts from several different domains (e.g. web, Europarl, news).
+  * We like the novel idea of using multitask learning for reranking.
+  * Although not mentioned explicitly in the paper, multitask learning is there used as a method of clever feature pruning. Feature engineering is becoming more and more important across all machine learning approaches.
+ Written by Karel Vandas and Martin Popel

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences