# Distributed Training Strategies for the Structured Perceptron - RG report - UNDER CONSTRUCTION

## Presentation

### 3 Structured Perceptron

• In unstructured perceptron, you are trying to separate two sets of with hyperplane. See Question 1 for the algorithm. In training phase, you iterate your training data and adjust the hyperplane every time you make a mistake. Youtube Example
• Structured (or multiclass) perceptron is generalization of the unstructured perceptron. See figure 1 in the paper for the algorithm.
• You can use any structured input x (not just vector, sentence for example) and any structured output y (not just binary value, parse tree for example)
• You need to have fuction f(x,y) which returns feature representation of candidate input-output pair
• Using the Theorem 1, you can bound the number of mistakes made during the training
• The computational time is therefore also bounded.
• This holds only for linearly separable sets.
• Other remarks and discussed issues
• The perceptron training algorithm does not always return the same weights (unlike maximal margin). It depends on order of training data.
• How the inference is done in the difficult tasks like parsing? Iterating all possible y? Approximation?

### 4 Distributed Structured Perceptron

• Motivation: There is no straightforward way to make the standard perceptron algorithm parallel.

#### 4.1 Parameter Mixing

• First attempt to map-reduce algorithm
• Divide the training data into shards
• In MAP step, train a perceptron on each shard
• In REDUCE step, average the trained weight vectors

## Questions

### Question 1

What are the weights w after running the stand perceptron training on the data bellow? The standard perceptron does only binary classification, i.e. y \in {0,1}
(unlike multi-class in Fig. 1) and its algorithm can be defined as follows:

StandardPerc(N, T={(x_1, y_1), ..., (x_|T|, y_|T|)})
w(0) = 0; k=0
for n:1..N
for t:1..|T|
y' = {if w(k) \dot x_t > 0.5: 1; else: 0}    // 1st difference to Figure1
if (y_t != y')
w(k+1) = w(k) + (y_t - y')*learning_rate*x_t    // 2nd difference to Figure1
k++
return w(k)

Let us set learning_rate α = 0.3,
X = [(1, 0), (0, 1)] data
Y = [0, 1]
classes
T = {(x_t, y_t): \forall t: x_t = X[t], y_t = Y[t] }
N = 5

w = [?, ?]

 x_1 x_2 y w_1 w_2 x \dot w y' e = y - y' Δw_1 = α * e * x_1 Δw_2 = α * e * x_2 1 0 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0.3 1 0 0 0 0.3 0 0 0 0 0 0 1 1 0 0.3 0.3 0 1 0 0.3 1 0 0 0 0.6 0 0 0 0 0 0 1 1 0 0.6 0.6 1 0 0 0

w = [0, 0.6]

### Question 2

Imagine you want to solve the learning problem above by the multi-class perceptron algorithm (Figure 1). First you will need to figure out what the function f is like, there are many good possibilities. Will you get different w? Can you think of one that will yield the same w given all the other variables are the same?

f = ?
w = [?, ?]

f(x,y) = (y == 0) ? (x, 0,0,…,0) : (0,0,…,0,x)

Acording to English Wikipedia: This multiclass formulation reduces to the original perceptron when x is a real-valued vector, y is chosen from {0,1}, and f(x,y) = y * x.

However, I would say, that this holds only for activation treshold = 0. Therefore, this formula cannot be used to compute example from Question 1.

### Question 3

In figure 4, why do you think that the F-measure for Regular Perceptron (first column) learned by the Serial (All Data) algorithm is worse than the Parallel (Iterative Parametere Mix)?

• Iterative Parameter Mixing is just a form of parameter averaging, which has the same effect as the averaged perceptron.
• F-measures for seral (All Data) and Paralel (Iterative Parameter Mix) are very similar in the second column. It is because the both methods are already averaged.
• Bagging like effect

### Question 4

Your eco-Manager is a hobby mathematician and he will not give power to your machines unless he sees a nice formula that theoretically justifies the optimality of N with respect to energy wasted by running them (and with respect to sunshine wasted to produce the electricity to be wasted by them). Fortunately, you don't need to justify your 4 days requirement (he understands;)). Therefore, he would like you to include variables like:
T - size of your training data,
F - number of your features,
N - number of machines,
and other relevant factors, like lags, bandwidth, …

What variables will you need to include? Can you come up with a sketch of such formula that would get your machines the “juice” from the manager? ;) (hint: imagine T is like 100k, and f is 500 thousand)

N = argmax_N f(N, T, F, …)
f = ? 