===== Searn in Practice =====
paper by: Hal Daumé III, John Langford and Daniel Marcu
presented by: Martin Popel
report by: Petra Galuščáková

===== Comments =====

  * Searn (stands for search-learn) is a novel algorithm for solving hard structured prediction problems. A structured prediction problem D is a cost-sensitive classification problem where Y has structure: elements y ∈ Y decompose into variable-length vectors (y<sub>1</sub>; y<sub>2</sub>; : : : ; y<sub>T</sub> ). D is a distribution over inputs x ∈ X and cost vectors c, where |c| is a variable in 2<sup>T</sup>. The example of such problem could be parsing, in which a structured output domain is the set of all possible parse trees.
  * At first we construct the search space and then we learn a classifier that will walk through the search space.
  * The classifier will be trained based on the data that it will actually expect to see - otherwise it often leads to the label-bias problem. Nice explanation of the label-bias problem could be found here: www.cs.stanford.edu/~nmramesh/crf.ppt
  * The algorithm starts with an optimal policy and in the each step it suggests a new policy (according to the losses given by the difference of the current policy and the ground truth) as a "interpolation" of the new policy and the old policy. Algorithm thus works with the whole history of the testing data.
  * Line search could be used for the calculation of β. The line search is (according to Wikipedia) an iterative approaches to finding a local minimum of an objective function. The line search approach first finds a descent direction along which the objective function will be reduced and then computes a step size that decides how far should move along that direction. The descent direction can be computed by various methods, such as gradient descent, Newton's method and Quasi-Newton method. The step size can be determined either exactly or inexactly.
  * The article mainly describes possibility of the application of the algorithms on the range of NLP problems:
       * Tagging
       * Parsing
       * Machine translation
  * The comparison with other algorithms aimed on the structured prediction problems (max arg classifier, independent classifier and maximum entropy Markov models, perceptron-style algorithm and global prediction algorithm such as conditional random field) is given in http://www.umiacs.umd.edu/~hal/docs/daume06searn.pdf

===== Questions =====
  *  It seems to me that we need ground true outputs for the testing data to run the algorithm (otherwise we cannot get the optimal policy), what makes no sense to me.
     * No, we need "ground true" labels during training for the initial (optimal) policy. Each iteration interpolates two policies. However, this does not mean interpolating weight vectors (or the trained models in general). It means we have stored both of the policies and during decoding (testing) we make a coin flip and with probability beta we choose one of the policies. In the Nth iteration we are effectively interpolating N policies (the first is the optimal one). During testing we use just the last policy (so no need for "ground true" during testing).
  * Similarly and probably based on something that I missed: why do we want to move away from optimal policy completely. Maybe it is because at the end of the algorithm we return the current policy without the optimal policy. But what does "without" mean in this case?
  * I do not see why the first condition in formula (4) is y<sub>t</sub> ∈ {begin X, out}. Shouldn't it be y<sub>t</sub> ∈ {out}?
    * No. We count the percentage of words which are correctly tagged as named entities, not the percentage of named entities.
  * Are there any real world applications using Searn?
  * Are there any kind of problems, in which could Searn especially help?

===== What do we like about the paper =====

  * The paper presents an approach that seems to have a wide range of possibilities to apply. The solution to some of them is already sketched. It is also proclaimed to be fast and according to the results given in http://www.umiacs.umd.edu/~hal/docs/daume06searn.pdf it outperforms another techniques used for structured prediction in most of the cases.
  * There is an example code available as well as the software for running Searn, only "code for doing search and computing features" needs to be provided.