Differences

This shows you the differences between two versions of the page.

--- courses:rg:2012:distributed-perceptron [2012/12/16 15:39]
machacek
+++ courses:rg:2012:distributed-perceptron [2012/12/16 23:40]
machacek
@@ Line 1: / Line 1: @@
-====== Distributed Training Strategies for the Structured Perceptron - RG report ======
+====== Distributed Training Strategies for the Structured Perceptron - RG report - UNDER CONSTRUCTION ======
+===== Presentation =====
+==== 3 Structured Perceptron ====
+  * In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http://www.youtube.com/watch?v=vGwemZhPlsA|Youtube Example]]
+  * Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm.
+  * You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example)
+  * You need to have fuction f(x,y) which returns feature representation of candidate input-output pair
+  * Using the Theorem 1, you can bound the number of mistakes made during the training
+    * The computational time is therefore also bounded.
+    * This holds only for linearly separable sets.
+  * Other remarks and discussed issues
+    * The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data.
+    * How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation?
+==== 4 Distributed Structured Perceptron ====
+  * Motivation: There is no straightforward way to make the standard perceptron algorithm parallel.
+=== 4.1 Parameter Mixing ===
+=== 4.2 Iterative Parameter Mixing ===
+==== 5 Experiments ====
 ===== Questions =====
@@ Line 19: / Line 47: @@
-Let us set learning_rate=0.3,
+Let us set learning_rate α = 0.3,
 X = [(1, 0), (0, 1)]  // data
 Y = [0, 1]  // classes
@@ Line 26: / Line 54: @@
 w = [?, ?]
+**Answer:**
+| x_1 | x_2 | y | w_1 | w_2 | x \dot w | y' | e = y - y' | Δw_1 = α * e * x_1 | Δw_2 = α * e * x_2 |
+|1|0|0|0|0|0|0|0|0|0|
+|0|1|1|0|0|0|0|1|0|0.3|
+|1|0|0|0|0.3|0|0|0|0|0|
+|0|1|1|0|0.3|0.3|0|1|0|0.3|
+|1|0|0|0|0.6|0|0|0|0|0|
+|0|1|1|0|0.6|0.6|1|0|0|0|
+w = [0, 0.6]
 ==== Question 2 ====
@@ Line 32: / Line 74: @@
 f = ?
 w = [?, ?]
+**Answer 1:**
+f(**x**,y) = (y == 0) ? (**x**, 0,0,...,0) : (0,0,...,0,**x**)
+**Answer 2:**
+Acording to English [[http://en.wikipedia.org/wiki/Perceptron#Multiclass_perceptron|Wikipedia]]: This multiclass formulation reduces to the original perceptron when **x** is a real-valued vector, y is chosen from {0,1}, and f(**x**,y) = y * **x**.
+However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1.
 ==== Question 3 ====
 In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)?
+**Answer:**
+  * Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron.
+    * F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged.
+  * Bagging like effect
 ==== Question 4 ====
@@ Line 49: / Line 109: @@
 N = argmax_N f(N, T, F, ...)
 f = ?
+**Answer:**
+We have not concluded on a particular formula.
+  * It also depends on convergence criteria.
+  * With no time limitation, the serial algorithm would have the least energy consumption.
+  * With time limitation, we should use as least shards to meet the time limitation.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences