[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
courses:rg:2012:distributed-perceptron [2012/12/16 15:39]
machacek
courses:rg:2012:distributed-perceptron [2012/12/16 23:40]
machacek
Line 1: Line 1:
-====== Distributed Training Strategies for the Structured Perceptron - RG report ======+====== Distributed Training Strategies for the Structured Perceptron - RG report - UNDER CONSTRUCTION ====== 
 + 
 +===== Presentation ===== 
 + 
 +==== 3 Structured Perceptron ==== 
 + 
 +  * In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http://www.youtube.com/watch?v=vGwemZhPlsA|Youtube Example]] 
 + 
 +  * Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm. 
 +  * You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example) 
 +  * You need to have fuction f(x,y) which returns feature representation of candidate input-output pair 
 +  * Using the Theorem 1, you can bound the number of mistakes made during the training 
 +    * The computational time is therefore also bounded. 
 +    * This holds only for linearly separable sets. 
 +  * Other remarks and discussed issues 
 +    * The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data. 
 +    * How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation? 
 + 
 +==== 4 Distributed Structured Perceptron ==== 
 + 
 +  * Motivation: There is no straightforward way to make the standard perceptron algorithm parallel.  
 + 
 +=== 4.1 Parameter Mixing === 
 + 
 +=== 4.2 Iterative Parameter Mixing === 
 + 
 +==== 5 Experiments ==== 
 + 
  
 ===== Questions ===== ===== Questions =====
Line 19: Line 47:
  
  
-Let us set learning_rate=0.3,+Let us set learning_rate α = 0.3,
 X = [(1, 0), (0, 1)]  // data X = [(1, 0), (0, 1)]  // data
 Y = [0, 1]  // classes Y = [0, 1]  // classes
Line 26: Line 54:
  
 w = [?, ?] w = [?, ?]
 +
 +**Answer:**
 +
 +| x_1 | x_2 | y | w_1 | w_2 | x \dot w | y' | e = y - y' | Δw_1 = α * e * x_1 | Δw_2 = α * e * x_2 |
 +|1|0|0|0|0|0|0|0|0|0|
 +|0|1|1|0|0|0|0|1|0|0.3|
 +|1|0|0|0|0.3|0|0|0|0|0|
 +|0|1|1|0|0.3|0.3|0|1|0|0.3|
 +|1|0|0|0|0.6|0|0|0|0|0|
 +|0|1|1|0|0.6|0.6|1|0|0|0|
 +
 +w = [0, 0.6]
 +
 +  
  
 ==== Question 2 ==== ==== Question 2 ====
Line 32: Line 74:
 f = ? f = ?
 w = [?, ?] w = [?, ?]
 +
 +**Answer 1:**
 +
 +f(**x**,y) = (y == 0) ? (**x**, 0,0,...,0) : (0,0,...,0,**x**)
 +
 +**Answer 2:**
 +
 +Acording to English [[http://en.wikipedia.org/wiki/Perceptron#Multiclass_perceptron|Wikipedia]]: This multiclass formulation reduces to the original perceptron when **x** is a real-valued vector, y is chosen from {0,1}, and f(**x**,y) = y * **x**.
 +
 +However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1.
 +
  
 ==== Question 3 ==== ==== Question 3 ====
 In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)? In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)?
 +
 +
 +**Answer:**
 +
 +  * Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron.
 +    * F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged.
 +  * Bagging like effect
  
 ==== Question 4 ==== ==== Question 4 ====
Line 49: Line 109:
 N = argmax_N f(N, T, F, ...) N = argmax_N f(N, T, F, ...)
 f = ? f = ?
 +
 +**Answer:**
 +
 +We have not concluded on a particular formula.
 +  * It also depends on convergence criteria. 
 +  * With no time limitation, the serial algorithm would have the least energy consumption. 
 +  * With time limitation, we should use as least shards to meet the time limitation. 

[ Back to the navigation ] [ Back to the content ]