[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
courses:rg:2013:memm [2013/03/27 16:56]
vandemos
courses:rg:2013:memm [2014/10/12 15:04] (current)
popel
Line 1: Line 1:
-===== Maximum Entropy Markov Models - Questions & Answers =====+===== Maximum Entropy Markov Models - Questions =====
  
-**1. Explain (roughly) how the new formula for α_t+1(s) is derived (i.e. formula 1 in the paper).** +1. Explain (roughly) how the new formula for α_t+1(s) is derived (i.e. formula 1 in the paper).
-The formula is used for calculating the foward probabilities at each time step t, meaning it needs to use the forward probability calculated so far, the α_t(s'). And instead of the seperate transition probability P(s|s') and observation probability P(o_t+1|s) for HMMs, these are replaced with the single probability P_s'(s|o_t+1) for MEMMs. All of this is simply based on the changes introduced by the paper+
  
-It is important to note that these two formulas are not equalThe original formula belongs to a generative joint modelwhereas the new model belongs to a conditional model+2Section 2.1 states "we will split P(s|s',o) into |S| separately trained transition functions"What are the advantages and disadvantages of this approach?
  
-**2. Section 2.1 states "we will split P(s|s',o) into |S| separately trained transition functions". What are the advantages and disadvantages of this approach?** +3. Let S= {V,N} (verb and non-verb)
- +
-This was the bonus question since the authors of the paper do not clearly state their reasons for splitting the functions. The only advantage we came up with during the discussion was the possibility for parallelization. One main disadvantage of this splitting is mentioned in section 2.6 of the paper: the O(|S|^2) transition parameters and data sparsity. +
- +
-Clearly we are not the only ones unsure about this splitting, as shown by the analysis of this very same paper by Hal and Piyush: [[http://www.cs.utah.edu/~suresh/mediawiki/index.php/MLRG/spring10|Link]] +
- +
-**3. Let S= {V,N} (verb and non-verb)+
 Training data = he/N can/V can/V a/N can/N Training data = he/N can/V can/V a/N can/N
 //Observation features// are: //Observation features// are:
Line 19: Line 12:
 b3 = current word is “a” and next word is “can” b3 = current word is “a” and next word is “can”
 When implementing MEMM you need to define s_0, i.e. the previous state before the first token. It may be a special NULL, but for simplicity let’s define it as N. When implementing MEMM you need to define s_0, i.e. the previous state before the first token. It may be a special NULL, but for simplicity let’s define it as N.
-a) What are the states (s) and observations (o) for this training data?** +a) What are the states (s) and observations (o) for this training data?
-States (s): {N,V} +
-Observations (o): {he,can,a} +
- +
-**b) Equation (2) defines features f_a based on //observation features// b. How many such f_a features do we have?** +
-There are three binary features and two types of states, so 6 f_a features in total. +
- +
-**c) Equation (3) defines constraints. How many such constraints do we have?** +
-There are six f_a features and two types of (previous) states, so 12 constraints in total. +
- +
-**d) List all the constraints involving feature b2, i.e. substitute (whenever possible) concrete numbers into Equation (3).** +
-For s'=N and a = <b2, N>  +
-LHS:  +
-(1/3)*[f_<b2,N>(he,N) + f_<b2,N>(can,V) + f_<b2,N>(can,N)]  +
-= (1/3)[0+0+1] +
-RHS:  +
-(1/3)*[P_N(N|he)*f_<b2,N>(he,N) + P_N(N|can)*f_<b2,N>(can,V) + P_N(N|can)*f_<b2,N>(can,N)]  +
-= (1/3)[P_N(N|can)] +
- +
-For s'=N and a = <b2, V>  +
-LHS:  +
-(1/3)*[f_<b2,V>(he,N) + f_<b2,V>(can,V) + f_<b2,V>(can,N)]  +
-= (1/3)[0+1+0] +
-RHS:  +
-(1/3)*[P_N(V|he)*f_<b2,V>(he,N) + P_N(V|can)*f_<b2,V>(can,V) + P_N(V|can)*f_<b2,V>(can,N)]  +
-= (1/3)[P_N(V|can)] +
- +
-For s'=V and a = <b2, N>  +
-LHS:  +
-(1/2)*[f_<b2,N>(can,V) + f_<b2,N>(a,N)]  +
-= (1/2)[0+0] +
-RHS:  +
-(1/2)*[P_V(N|can)*f_<b2,N>(can,V) + P_V(N|a)*f_<b2,N>(a,N)]  +
-= 0 +
- +
-For s'=V and a = <b2, V>  +
-LHS:  +
-(1/2)*[f_<b2,V>(can,V) + f_<b2,V>(a,N)]  +
-= (1/2)[1+0] +
-RHS:  +
-(1/2)*[P_V(V|can)*f_<b2,V>(can,V) + P_V(V|a)*f_<b2,V>(a,N)]  +
-= (1/2)[P_V(V|can)]+
  
-**eIn step 3 of the GIS algorithm you need to compute <latex>P_{s’}^{(j)}(s|o)</latex>. Compute <latex>P_N^{(0)}(N|can)</latex> and <latex>P_N^{(0)}(V|can)</latex>.** +bEquation (2defines features f_a based on //observation features// bHow many such f_a features do we have?
-Since we start in iteration 0, the lambdas are all equal to 1.+
  
-P_N^{(0)}(N|can)  +cEquation (3defines constraints. How many such constraints do we have?
-= 1/norm * exp(f_<b1,N>(can,N) + f_<b1,V>(can,N) + f_<b2,N>(can,N) + f_<b2,V>(can,N) + f_<b3,N>(can,N) + f_<b3,V>(can,N)) +
-= 1/norm * exp(0 + 0 + 1 + 0 + 0 + 0)+
  
-P_N^{(0)}(V|can)  +dList all the constraints involving feature b2, i.e. substitute (whenever possibleconcrete numbers into Equation (3).
-= 1/norm * exp(f_<b1,N>(can,V) + f_<b1,V>(can,V) + f_<b2,N>(can,V) + f_<b2,V>(can,V) + f_<b3,N>(can,V) + f_<b3,V>(can,V)) +
-= 1/norm * exp(0 + 0 + 0 + 1 + 0 + 0)+
  
-norm = 2e+e) In step 3 of the GIS algorithm you need to compute <latex>P_{s’}^{(j)}(s|o)</latex>. Compute <latex>P_N^{(0)}(N|can)</latex> and <latex>P_N^{(0)}(V|can)</latex>.
  
 **Hint** : You might be confused about the m_s' variable (and  t_1, …, <latex>t_{m_{s'}}</latex>) in Equation (3). **Hint** : You might be confused about the m_s' variable (and  t_1, …, <latex>t_{m_{s'}}</latex>) in Equation (3).

[ Back to the navigation ] [ Back to the content ]