Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
courses:rg:2012:distributed-perceptron [2012/12/16 15:39] machacek |
courses:rg:2012:distributed-perceptron [2012/12/16 17:43] machacek |
||
---|---|---|---|
Line 1: | Line 1: | ||
====== Distributed Training Strategies for the Structured Perceptron - RG report ====== | ====== Distributed Training Strategies for the Structured Perceptron - RG report ====== | ||
+ | |||
+ | ===== Presentation ===== | ||
+ | |||
+ | ==== 3 Structured Perceptron ==== | ||
+ | |||
+ | ==== 4 Distributed Structured Perceptron ==== | ||
+ | |||
+ | === 4.1 Parameter Mixing === | ||
+ | |||
+ | === 4.2 Iterative Parameter Mixing === | ||
+ | |||
+ | ==== 5 Experiments ==== | ||
+ | |||
+ | |||
===== Questions ===== | ===== Questions ===== | ||
Line 19: | Line 33: | ||
- | Let us set learning_rate=0.3, | + | Let us set learning_rate |
X = [(1, 0), (0, 1)] // data | X = [(1, 0), (0, 1)] // data | ||
Y = [0, 1] // classes | Y = [0, 1] // classes | ||
Line 26: | Line 40: | ||
w = [?, ?] | w = [?, ?] | ||
+ | |||
+ | **Answer:** | ||
+ | |||
+ | | x_1 | x_2 | y | w_1 | w_2 | x \dot w | y' | e = y - y' | Δw_1 = α * e * x_1 | Δw_2 = α * e * x_2 | | ||
+ | |1|0|0|0|0|0|0|0|0|0| | ||
+ | |0|1|1|0|0|0|0|1|0|0.3| | ||
+ | |1|0|0|0|0.3|0|0|0|0|0| | ||
+ | |0|1|1|0|0.3|0.3|0|1|0|0.3| | ||
+ | |1|0|0|0|0.6|0|0|0|0|0| | ||
+ | |0|1|1|0|0.6|0.6|1|0|0|0| | ||
+ | |||
+ | w = [0, 0.6] | ||
==== Question 2 ==== | ==== Question 2 ==== | ||
Line 32: | Line 58: | ||
f = ? | f = ? | ||
w = [?, ?] | w = [?, ?] | ||
+ | |||
+ | **Answer 1:** | ||
+ | |||
+ | f(**x**,y) = (y == 0) ? (**x**, 0,0,...,0) : (0, | ||
+ | |||
+ | **Answer 2:** | ||
+ | |||
+ | Acording to English [[http:// | ||
+ | |||
+ | However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1. | ||
+ | |||
==== Question 3 ==== | ==== Question 3 ==== | ||
In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)? | In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)? | ||
+ | |||
+ | |||
+ | **Answer:** | ||
+ | |||
+ | * Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron. | ||
+ | * F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged. | ||
+ | * Bagging like effect | ||
==== Question 4 ==== | ==== Question 4 ==== | ||
Line 49: | Line 93: | ||
N = argmax_N f(N, T, F, ...) | N = argmax_N f(N, T, F, ...) | ||
f = ? | f = ? | ||
+ | |||
+ | **Answer:** | ||
+ | |||
+ | We have not concluded on a particular formula. | ||
+ | * It also depends on convergence criteria. | ||
+ | * With no time limitation, the serial algorithm would have the least energy consumption. | ||
+ | * With time limitation, we should use as least shards to meet the time limitation. |