[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2012:distributed-perceptron [2012/12/16 15:56]
machacek
courses:rg:2012:distributed-perceptron [2012/12/16 23:44] (current)
machacek
Line 1: Line 1:
-====== Distributed Training Strategies for the Structured Perceptron - RG report ======+====== Distributed Training Strategies for the Structured Perceptron - RG report - UNDER CONSTRUCTION ====== 
 + 
 +===== Presentation ===== 
 + 
 +==== 3 Structured Perceptron ==== 
 + 
 +  * In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http://www.youtube.com/watch?v=vGwemZhPlsA|Youtube Example]] 
 + 
 +  * Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm. 
 +  * You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example) 
 +  * You need to have fuction f(x,y) which returns feature representation of candidate input-output pair 
 +  * Using the Theorem 1, you can bound the number of mistakes made during the training 
 +    * The computational time is therefore also bounded. 
 +    * This holds only for linearly separable sets. 
 +  * Other remarks and discussed issues 
 +    * The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data. 
 +    * How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation? 
 + 
 +==== 4 Distributed Structured Perceptron ==== 
 + 
 +  * Motivation: There is no straightforward way to make the standard perceptron algorithm parallel.  
 + 
 +=== 4.1 Parameter Mixing === 
 + 
 +  * First attempt to map-reduce algorithm 
 +  * Divide the training data into shards 
 +  * In MAP step, train a perceptron on each shard 
 +  * In REDUCE step, average the trained weight vectors 
 + 
 +=== 4.2 Iterative Parameter Mixing === 
 + 
 +==== 5 Experiments ==== 
 + 
  
 ===== Questions ===== ===== Questions =====
Line 38: Line 71:
  
 w = [0, 0.6] w = [0, 0.6]
 +
 +  
  
 ==== Question 2 ==== ==== Question 2 ====
Line 44: Line 79:
 f = ? f = ?
 w = [?, ?] w = [?, ?]
 +
 +**Answer 1:**
 +
 +f(**x**,y) = (y == 0) ? (**x**, 0,0,...,0) : (0,0,...,0,**x**)
 +
 +**Answer 2:**
 +
 +Acording to English [[http://en.wikipedia.org/wiki/Perceptron#Multiclass_perceptron|Wikipedia]]: This multiclass formulation reduces to the original perceptron when **x** is a real-valued vector, y is chosen from {0,1}, and f(**x**,y) = y * **x**.
 +
 +However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1.
 +
  
 ==== Question 3 ==== ==== Question 3 ====
 In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)? In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)?
 +
 +
 +**Answer:**
 +
 +  * Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron.
 +    * F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged.
 +  * Bagging like effect
  
 ==== Question 4 ==== ==== Question 4 ====
Line 61: Line 114:
 N = argmax_N f(N, T, F, ...) N = argmax_N f(N, T, F, ...)
 f = ? f = ?
 +
 +**Answer:**
 +
 +We have not concluded on a particular formula.
 +  * It also depends on convergence criteria. 
 +  * With no time limitation, the serial algorithm would have the least energy consumption. 
 +  * With time limitation, we should use as least shards to meet the time limitation. 

[ Back to the navigation ] [ Back to the content ]