Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
courses:rg:2012:distributed-perceptron [2012/12/16 15:23] machacek vytvořeno |
courses:rg:2012:distributed-perceptron [2012/12/16 23:44] (current) machacek |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ====== Distributed Training Strategies for the Structured Perceptron - RG report ====== | + | ====== Distributed Training Strategies for the Structured Perceptron - RG report |
| + | |||
| + | ===== Presentation ===== | ||
| + | |||
| + | ==== 3 Structured Perceptron ==== | ||
| + | |||
| + | * In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http:// | ||
| + | |||
| + | * Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm. | ||
| + | * You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example) | ||
| + | * You need to have fuction f(x,y) which returns feature representation of candidate input-output pair | ||
| + | * Using the Theorem 1, you can bound the number of mistakes made during the training | ||
| + | * The computational time is therefore also bounded. | ||
| + | * This holds only for linearly separable sets. | ||
| + | * Other remarks and discussed issues | ||
| + | * The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data. | ||
| + | * How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation? | ||
| + | |||
| + | ==== 4 Distributed Structured Perceptron ==== | ||
| + | |||
| + | * Motivation: There is no straightforward way to make the standard perceptron algorithm parallel. | ||
| + | |||
| + | === 4.1 Parameter Mixing === | ||
| + | |||
| + | * First attempt to map-reduce algorithm | ||
| + | * Divide the training data into shards | ||
| + | * In MAP step, train a perceptron on each shard | ||
| + | * In REDUCE step, average the trained weight vectors | ||
| + | |||
| + | === 4.2 Iterative Parameter Mixing === | ||
| + | |||
| + | ==== 5 Experiments ==== | ||
| + | |||
| ===== Questions ===== | ===== Questions ===== | ||
| - | 1) What are the weights w after running the stand perceptron training on the data bellow? The standard perceptron does only binary classification, | + | ==== Question |
| + | What are the weights w after running the stand perceptron training on the data bellow? The standard perceptron does only binary classification, | ||
| (unlike multi-class in Fig. 1) and its algorithm can be defined as follows: | (unlike multi-class in Fig. 1) and its algorithm can be defined as follows: | ||
| - | StandardPerc(N, | + | |
| - | w(0) = 0; k=0 | + | w(0) = 0; k=0 |
| - | for n:1..N | + | for n:1..N |
| - | for t:1..|T| | + | for t:1..|T| |
| - | y' = {if w(k) \dot x_t > 0.5: 1; else: 0} // 1st difference to Figure1 | + | y' = {if w(k) \dot x_t > 0.5: 1; else: 0} // 1st difference to Figure1 |
| - | if (y_t != y') | + | if (y_t != y') |
| - | w(k+1) = w(k) + (y_t - y' | + | w(k+1) = w(k) + (y_t - y' |
| - | k++ | + | k++ |
| - | return w(k) | + | return w(k) |
| - | Let us set learning_rate=0.3, | + | |
| + | |||
| + | Let us set learning_rate | ||
| X = [(1, 0), (0, 1)] // data | X = [(1, 0), (0, 1)] // data | ||
| Y = [0, 1] // classes | Y = [0, 1] // classes | ||
| Line 24: | Line 60: | ||
| w = [?, ?] | w = [?, ?] | ||
| - | 2) Imagine you want to solve the learning problem above by the multi-class perceptron algorithm (Figure 1). First you will need to figure out what the function f is like, there are many good possibilities. Will you get different w? Can you think of one that will yield the same w given all the other variables are the same? | + | **Answer: |
| + | |||
| + | | x_1 | x_2 | y | w_1 | w_2 | x \dot w | y' | e = y - y' | Δw_1 = α * e * x_1 | Δw_2 = α * e * x_2 | | ||
| + | |1|0|0|0|0|0|0|0|0|0| | ||
| + | |0|1|1|0|0|0|0|1|0|0.3| | ||
| + | |1|0|0|0|0.3|0|0|0|0|0| | ||
| + | |0|1|1|0|0.3|0.3|0|1|0|0.3| | ||
| + | |1|0|0|0|0.6|0|0|0|0|0| | ||
| + | |0|1|1|0|0.6|0.6|1|0|0|0| | ||
| + | |||
| + | w = [0, 0.6] | ||
| + | |||
| + | |||
| + | |||
| + | ==== Question | ||
| + | Imagine you want to solve the learning problem above by the multi-class perceptron algorithm (Figure 1). First you will need to figure out what the function f is like, there are many good possibilities. Will you get different w? Can you think of one that will yield the same w given all the other variables are the same? | ||
| f = ? | f = ? | ||
| w = [?, ?] | w = [?, ?] | ||
| - | 3) In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)? | + | **Answer 1:** |
| - | 4) Imagine you are an engineer at Google (lots of zeros on your bank account, unlimited food, etc.. ;)) and want to learn your " | + | f(**x**,y) = (y == 0) ? (**x**, 0,0,...,0) : (0, |
| + | |||
| + | **Answer 2:** | ||
| + | |||
| + | Acording to English [[http:// | ||
| + | |||
| + | However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1. | ||
| + | |||
| + | |||
| + | ==== Question 3 ==== | ||
| + | In figure | ||
| + | |||
| + | |||
| + | **Answer: | ||
| + | |||
| + | * Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron. | ||
| + | * F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged. | ||
| + | * Bagging like effect | ||
| + | |||
| + | ==== Question 4 ==== | ||
| + | Imagine you are an engineer at Google (lots of zeros on your bank account, unlimited food, etc.. ;)) and want to learn your " | ||
| Your eco-Manager is a hobby mathematician and he will not give power to your machines unless he sees a nice formula that theoretically justifies the optimality of N with respect to energy wasted by running them (and with respect to sunshine wasted to produce the electricity to be wasted by them). Fortunately, | Your eco-Manager is a hobby mathematician and he will not give power to your machines unless he sees a nice formula that theoretically justifies the optimality of N with respect to energy wasted by running them (and with respect to sunshine wasted to produce the electricity to be wasted by them). Fortunately, | ||
| Line 43: | Line 114: | ||
| N = argmax_N f(N, T, F, ...) | N = argmax_N f(N, T, F, ...) | ||
| f = ? | f = ? | ||
| + | |||
| + | **Answer:** | ||
| + | |||
| + | We have not concluded on a particular formula. | ||
| + | * It also depends on convergence criteria. | ||
| + | * With no time limitation, the serial algorithm would have the least energy consumption. | ||
| + | * With time limitation, we should use as least shards to meet the time limitation. | ||
