Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
courses:rg:2012:distributed-perceptron [2012/12/16 21:59] machacek |
courses:rg:2012:distributed-perceptron [2012/12/16 23:44] (current) machacek |
||
---|---|---|---|
Line 1: | Line 1: | ||
- | ====== Distributed Training Strategies for the Structured Perceptron - RG report ====== | + | ====== Distributed Training Strategies for the Structured Perceptron - RG report |
===== Presentation ===== | ===== Presentation ===== | ||
Line 5: | Line 5: | ||
==== 3 Structured Perceptron ==== | ==== 3 Structured Perceptron ==== | ||
- | * In unstructured perceptron, you are trying to separate two sets with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http:// | + | * In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. [[http:// |
* Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm. | * Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm. | ||
+ | * You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example) | ||
+ | * You need to have fuction f(x,y) which returns feature representation of candidate input-output pair | ||
+ | * Using the Theorem 1, you can bound the number of mistakes made during the training | ||
+ | * The computational time is therefore also bounded. | ||
+ | * This holds only for linearly separable sets. | ||
+ | * Other remarks and discussed issues | ||
+ | * The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data. | ||
+ | * How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation? | ||
==== 4 Distributed Structured Perceptron ==== | ==== 4 Distributed Structured Perceptron ==== | ||
Line 14: | Line 22: | ||
=== 4.1 Parameter Mixing === | === 4.1 Parameter Mixing === | ||
+ | |||
+ | * First attempt to map-reduce algorithm | ||
+ | * Divide the training data into shards | ||
+ | * In MAP step, train a perceptron on each shard | ||
+ | * In REDUCE step, average the trained weight vectors | ||
=== 4.2 Iterative Parameter Mixing === | === 4.2 Iterative Parameter Mixing === |