Kevin Duh, Katsuhito Sudoh, Hajime Tsukada, Hideki Isozaki, Masaaki Nagata
http://www.aclweb.org/anthology/W/W10/W10-1757.pdf
Kevin's slides
ACL 5th Workshop on Statistical Machine Translation (WMT) 2010
It would be great to have an illustrative but simple example of N-best list and also examples of features and examples of labels (to specify the terminology).
Answer by Kevin Duh:
I think some of the unclearness stems from the fact that I tried to present various multitask learning algorithms under the same framework, but in practice the details differ for each algorithm. The number of input (hashed) features is 4000. For Shared Subspace and Unsupervised Select, we picked from {250,500,1000} features, but for Joint Regularization we do ExtractCommonFeature(W).
Answer by Kevin Duh:
The three multitask methods work on hashed representation of features, and the thresholding is on the original features, so the number of features for “Unsupervised FeatureSelect + Feature Threshold x>10” is really 60,500 distinct features. I counted them separately (though in practice it is possible that some hashed features have close analogs to original features). We didn't do an quantitative analysis that tries to map selected hash features to original features, but instead did another experiment on smaller dataset that directly trains on the original features (see footnote 10): the result was that “rare” features were those involving conjunctions including function words and special characters.
Obycejna l2 regularizace na jedne uloze stahuje vektor vah smerem ke spicce hyperkuzelu, podobne jako by je do sveho stredu stahoval gausovsky prior. U te kombinovane l1/l2 pro multitask to zas stahuje jednotlive vahove vektory k sobe tak, ze ten regularizator ma tvar udoli s osou v diagonalnim smeru, ktere se navic svazuje do pocatku souradnic, kterym clovek reprezentuje apriorni znalost, ze ty vektory by mely byt pokud mozno podobne.
Written by Karel Vandas and Martin Popel