====== Distributed Training Strategies for the Structured Perceptron - RG report - UNDER CONSTRUCTION ======

===== Presentation =====

==== 3 Structured Perceptron ====

  * In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http://www.youtube.com/watch?v=vGwemZhPlsA|Youtube Example]]

  * Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm.
  * You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example)
  * You need to have fuction f(x,y) which returns feature representation of candidate input-output pair
  * Using the Theorem 1, you can bound the number of mistakes made during the training
    * The computational time is therefore also bounded.
    * This holds only for linearly separable sets.
  * Other remarks and discussed issues
    * The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data.
    * How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation?

==== 4 Distributed Structured Perceptron ====

  * Motivation: There is no straightforward way to make the standard perceptron algorithm parallel. 

=== 4.1 Parameter Mixing ===

  * First attempt to map-reduce algorithm
  * Divide the training data into shards
  * In MAP step, train a perceptron on each shard
  * In REDUCE step, average the trained weight vectors

=== 4.2 Iterative Parameter Mixing ===

==== 5 Experiments ====


===== Questions =====

==== Question 1 ====
What are the weights w after running the stand perceptron training on the data bellow? The standard perceptron does only binary classification, i.e. y \in {0,1}
(unlike multi-class in Fig. 1) and its algorithm can be defined as follows:

  StandardPerc(N, T={(x_1, y_1), ..., (x_|T|, y_|T|)})
  w(0) = 0; k=0
  for n:1..N
    for t:1..|T|
      y' = {if w(k) \dot x_t > 0.5: 1; else: 0}    // 1st difference to Figure1
      if (y_t != y')
        w(k+1) = w(k) + (y_t - y')*learning_rate*x_t    // 2nd difference to Figure1
        k++
  return w(k)


Let us set learning_rate α = 0.3,
X = [(1, 0), (0, 1)]  // data
Y = [0, 1]  // classes
T = {(x_t, y_t): \forall t: x_t = X[t], y_t = Y[t] }
N = 5

w = [?, ?]

**Answer:**

| x_1 | x_2 | y | w_1 | w_2 | x \dot w | y' | e = y - y' | Δw_1 = α * e * x_1 | Δw_2 = α * e * x_2 |
|1|0|0|0|0|0|0|0|0|0|
|0|1|1|0|0|0|0|1|0|0.3|
|1|0|0|0|0.3|0|0|0|0|0|
|0|1|1|0|0.3|0.3|0|1|0|0.3|
|1|0|0|0|0.6|0|0|0|0|0|
|0|1|1|0|0.6|0.6|1|0|0|0|

w = [0, 0.6]

  
==== Question 2 ====
Imagine you want to solve the learning problem above by the multi-class perceptron algorithm (Figure 1). First you will need to figure out what the function f is like, there are many good possibilities. Will you get different w? Can you think of one that will yield the same w given all the other variables are the same?

f = ?
w = [?, ?]

**Answer 1:**

f(**x**,y) = (y == 0) ? (**x**, 0,0,...,0) : (0,0,...,0,**x**)

**Answer 2:**

Acording to English [[http://en.wikipedia.org/wiki/Perceptron#Multiclass_perceptron|Wikipedia]]: This multiclass formulation reduces to the original perceptron when **x** is a real-valued vector, y is chosen from {0,1}, and f(**x**,y) = y * **x**.

However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1.


==== Question 3 ====
In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)?


**Answer:**

  * Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron.
    * F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged.
  * Bagging like effect

==== Question 4 ====
Imagine you are an engineer at Google (lots of zeros on your bank account, unlimited food, etc.. ;)) and want to learn your "w" by parallel perceptron (with iterative parameter mix). You are fed up with your work though so you want it finished in approximately 1 day so that you can read the all your unread Twitter posts. But Google has just launched an ecological campain and you need to justify the number of machines you are going to power on for your learning experiment to the eco-Manager, who makes sure that Google is running the optimal number of machines (N). 

Your eco-Manager is a hobby mathematician and he will not give power to your machines unless he sees a nice formula that theoretically justifies the optimality of N with respect to energy wasted by running them (and with respect to sunshine wasted to produce the electricity to be wasted by them). Fortunately, you don't need to justify your 4 days requirement (he understands;)). Therefore, he would like you to include variables like: 
 T - size of your training data, 
 F - number of your features, 
 N - number of machines, 
 and other relevant factors, like lags, bandwidth, ...

What variables will you need to include? Can you come up with a sketch of such formula that would get your machines the "juice" from the manager? ;) (hint: imagine T is like 100k, and f is 500 thousand)

N = argmax_N f(N, T, F, ...)
f = ?

**Answer:**

We have not concluded on a particular formula.
  * It also depends on convergence criteria. 
  * With no time limitation, the serial algorithm would have the least energy consumption. 
  * With time limitation, we should use as least shards to meet the time limitation.