[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:reranking-by-multitask-learning [2010/10/26 02:02]
popel added comments + tidy up
courses:rg:reranking-by-multitask-learning [2011/09/20 11:32] (current)
majlis Opraveny preklepy
Line 1: Line 1:
 ====== N-best Reranking by Multitask Learning ====== ====== N-best Reranking by Multitask Learning ======
 Kevin Duh, Katsuhito Sudoh, Hajime Tsukada, Hideki Isozaki, Masaaki Nagata Kevin Duh, Katsuhito Sudoh, Hajime Tsukada, Hideki Isozaki, Masaaki Nagata
-[[http://aclweb.org/anthology-new/P/P10/P10-1160.pdf]]+http://www.aclweb.org/anthology/W/W10/W10-1757.pdf
 [[http://www.kecl.ntt.co.jp/icl/lirg/members/kevinduh/papers/duh10multitask-slides.pdf|Kevin's slides]] [[http://www.kecl.ntt.co.jp/icl/lirg/members/kevinduh/papers/duh10multitask-slides.pdf|Kevin's slides]]
 ACL 5th Workshop on Statistical Machine Translation (WMT) 2010 ACL 5th Workshop on Statistical Machine Translation (WMT) 2010
Line 21: Line 21:
       * Karel Vandas suppose that it (i.e. 250,500,1000) is just number of input features, if they were really used is not clear.       * Karel Vandas suppose that it (i.e. 250,500,1000) is just number of input features, if they were really used is not clear.
       * Martin Popel suppose that the authors actually use ExtractCommonFeature(W,z) which extracts z features with the highest weights (by computing 2-norm on columns of W).       * Martin Popel suppose that the authors actually use ExtractCommonFeature(W,z) which extracts z features with the highest weights (by computing 2-norm on columns of W).
 +
 +> Answer by Kevin Duh:
 +> //I think some of the unclearness stems from the fact that I tried to present various multitask learning algorithms under the same framework, but in practice the details differ for each algorithm. The number of input (hashed) features is 4000. For Shared Subspace and Unsupervised Select, we picked from {250,500,1000} features, but for Joint Regularization we do ExtractCommonFeature(W).//
  
   * According to Table 2, "Unsupervised FeatureSelect" resulted in 500 distinct features, "Feature threshold x > 10" resulted in 60 000 distinct features and "Unsupervised FeatureSelect + Feature threshold x > 10" resulted in 60 500 distinct features. This implies there was no overlap, in other words, all the features selected by "Unsupervised FeatureSelect" were rare, i.e. occurring 10 times or less in the training data. Similarly for "Joint Regularization + (b)" with 60 250 features and "Shared Subspace + (b)" with 61 000 features.   * According to Table 2, "Unsupervised FeatureSelect" resulted in 500 distinct features, "Feature threshold x > 10" resulted in 60 000 distinct features and "Unsupervised FeatureSelect + Feature threshold x > 10" resulted in 60 500 distinct features. This implies there was no overlap, in other words, all the features selected by "Unsupervised FeatureSelect" were rare, i.e. occurring 10 times or less in the training data. Similarly for "Joint Regularization + (b)" with 60 250 features and "Shared Subspace + (b)" with 61 000 features.
      * This seems rather strange and in contrary to the findings in Section 4.2 that the multitask learning algorithms extract //widely applicable// (i.e. not rare) features, such as general non-lexicalized features or features involving function words.      * This seems rather strange and in contrary to the findings in Section 4.2 that the multitask learning algorithms extract //widely applicable// (i.e. not rare) features, such as general non-lexicalized features or features involving function words.
      * One solution is that "61k" does not mean "exactly 61 000", but some smaller number, let's say 60 777. In my point of view, the most interesting question is "What are those 777 features which are rare but useful?" (They are useful, because they cause the improvement of 29.6 - 29.0 = 0.6 BLEU.) The last paragraph of Section 4.2 describes only frequent features which could be expected to be useful, but I would like to see a similar description of rare useful features.      * One solution is that "61k" does not mean "exactly 61 000", but some smaller number, let's say 60 777. In my point of view, the most interesting question is "What are those 777 features which are rare but useful?" (They are useful, because they cause the improvement of 29.6 - 29.0 = 0.6 BLEU.) The last paragraph of Section 4.2 describes only frequent features which could be expected to be useful, but I would like to see a similar description of rare useful features.
 +
 +> Answer by Kevin Duh:
 +> //The three multitask methods work on hashed representation of features, and the thresholding is on the original features, so the number of features for "Unsupervised FeatureSelect + Feature Threshold x>10" is really 60,500 distinct features. I counted them separately (though in practice it is possible that some hashed features have close analogs to original features). We didn't do an quantitative analysis that tries to map selected hash features to original features, but instead did another experiment on smaller dataset that directly trains on the original features (see footnote 10): the result was that "rare" features were those involving conjunctions including function words and special characters.//
  
   * Zdeněk Žabokrtský pointed out the similarity of the regularizers used in the paper and Bayes priors.   * Zdeněk Žabokrtský pointed out the similarity of the regularizers used in the paper and Bayes priors.

[ Back to the navigation ] [ Back to the content ]