[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
courses:rg:2012:stat-nlg [2012/11/29 13:08]
jan.vacl vytvořeno
courses:rg:2012:stat-nlg [2012/12/01 23:09] (current)
popel
Line 8: Line 8:
  
 ==== Pre-sent Exercises ==== ==== Pre-sent Exercises ====
-The following exercises were sent to the mail conference few days before and were aimed to make the readers think about the semantic stack representation used in the paper. They were not answered in very much detail in the lecture, just went through to make sure we understand the basic concepts. Thus, the following solutions are mostly my own interpretation and are not guaranteed to be 100 percent correct.+The following exercises were sent to the mail conference few days before and were aimed to make the readers think about the semantic stack representation used in the paper. They were not answered in very much detail in the lecture, just gone through to make sure we understand the basic concepts. Thus, the following solutions are mostly my own interpretation and are not guaranteed to be 100 percent correct.
  
-Ex. 1) Try to think of a semantic stack representation and a dialogue act +**Ex. 1)** Try to think of a semantic stack representation and a dialogue act for the following sentence: 
-for the following sentence: +"For Chinese food, The Golden Palace restaurant is the best one in the centre of the city. It is located on the side of the river near East Road."
-For Chinese food, The Golden Palace restaurant is the best one in the +
-centre of the city. It is located on the side of the river near East +
-Road.+
 The solution could look something like this (inspired by Table 1): The solution could look something like this (inspired by Table 1):
-^ surface form | For Chinese food| The Golden Palace | restaurant | is the best one | in the | +^ surface form | for Chinese food | The Golden Palace | restaurant | is the best one | in the | centre of the city | It is located | on the side of the river | near | East Road | 
-centre of the city | It is located | on the side of the river | near | East Road | +^ sem. stack    Chinese   The Golden Palace   restaurant     centre    riverside    East Road  
-^ sem. stack   | Chinese           | The Golden Palace | restaurant |                        | centre |        by-river        | East Road | +^  |  food     name    |  type              |  area    |  area    |  area     area    |  near     name    | 
-               |  food                   name           type                    |  area  |  area  |  area    area   |  near    name    | +^   inform   inform   inform    inform   inform   inform   inform  |  inform  inform  |  inform  |
-               | inform                 inform         inform   | inform          | inform | inform | inform |  inform  | inform |  inform   |+
  
-2) Try to think of a surface realization for the following dialogue act:+Note: The semantic stack for a given surface phrase is always the whole column (but **excluding** the "surface form" row). 
 + 
 +**Ex. 2)** Try to think of a surface realization for the following dialogue act:
 reject(type=placetoeat,eattype=restaurant,pricerange=cheap,area=citycentre,food=Indian) reject(type=placetoeat,eattype=restaurant,pricerange=cheap,area=citycentre,food=Indian)
 +Solution (one of many possible): 
 +"I am sorry but I have no information about any cheap Indian restaurant in the centre of the city."
  
 Note: Ondřej admitted that the syntax of the dialogue act in the Ex. 2 was not exactly the same as the one used throughout the paper, but rather taken out directly from the corpus used by the authors (publicly available). Note: Ondřej admitted that the syntax of the dialogue act in the Ex. 2 was not exactly the same as the one used throughout the paper, but rather taken out directly from the corpus used by the authors (publicly available).
  
-The paper was widely discussed throughout the whole session. The report tries to divide the points discussed in correspondence to the sections of the paper. 
- 
-===== 1 Introduction =====  
-The paper proposes a semi-automatic translation evaluation metric that is claimed to be both well correlated with human judgment (especially in comparison to BLEU) and less labour-intensive than HTER (which is claimed to be much more expensive). 
- 
-==== Question 1: Which translation is considered as "a good one" by (H)MEANT? ==== 
-Meant assumes that a good translation is one where the reader understands correctly "Who did what to whom, when, where and why" - which, as Martin noted, is rather adequacy than fluency, and therefore a comparison with BLEU, which is more fluency-oriented, is not completely fair. Moreover, good systems usually make more errors in adequacy than in fluency, which makes BLEU an even worse metric these days. 
- 
-Martin further explained that HTER is a metric where the humans post-edit the MT output to transform it into a correct translation, and then TER, which is actually a word-based Levenshtein distance, is computed as the score. 
-Matěj Korvas then pointed to an important difference between MEANT and HTER: MEANT uses reference translations, whereas HTER uses post-editations. Surprisingly, this is not noted in the paper. 
- 
-**Section 2 Related work was skipped.** 
- 
-===== 3 MEANT: SRL for MT evaluation =====  
-Here we look at how the evaluation is actually done. It consists of three steps, all done by humans in HMEANT. In MEANT, the first step is done automatically. 
- 
-==== Question 2: Which phases of annotations are there? ==== 
-  - SRL (semantic role labelling) of both the reference and the MT output; the labels are based on PropBank (but have nicer names) 
-  - aligning the frames - first, predicates are aligned, and then, for each matching pair of predicates, their arguments are aligned as well 
-  - ternary judging - deciding whether each matched role is translated correctly, incorrectly or only partially correctly 
- 
-The group discussed whether HMEANT evaluations are really faster than HTER annotations, as some of the readers participated in HMEANT evaluation. Some readers agree that about 5 minutes per sentence is quite accurate, while others state that 5 minutes are at best the lower bound. However, it is not completely clear whether all of the three phases of annotation are claimed to be done in 5 minutes. (Probably yes, but the less do the readers agree with the necessary times indicated.) 
- 
-==== Question 3: What does the set J contain in the C_precision formula? ==== 
-The answer is that it contains the arguments of the predicate. It actually contains all //possible// roles, where the non-present ones only add a zero to the sum and therefore do not influence the score. 
- 
-We further tried to compute the score for the following set of sentences: 
-  * Reference: //John loves Mary.// 
-  * MT1: //Stupid John loves Mary.// 
-  * MT2: //John loves Jack.// 
-  * MT3: //John hates Mary.// 
-We supposed that the semantic roles are the same in all cases, i.e. Agent for //John// or //Stupid John//, Predicate for //loves// or //hates//, and Experiencer for //Mary//. It was explained by Martin that //Stupid John// has no inner structure in HMEANT as there is no predicate in the phrase - HMEANT semantic annotation is shallow in that respect. Furthermore, we assumed (following the paper's Section 3) that the weights are uniform, i.e. //w_pred// = //w_j// = 0.1 and //w_partial// = 0.5. 
- 
-For MT1, the HMEANT score is equal to 1, because, according to the paper, extra information is not penalized, and the translation is therefore regarded as being completely correct. 
- 
-For MT2, //C_precision// is 1, but //C_recall// is only 2/3, and the HMEANT score, which is the F-score, is therefore 4/5. 
- 
-For MT3, the predicates do not match, and therefore no arguments are taken into account and the score is 0. Martin and Ruda agreed that most probably not even a partial match of predicates can be annotated, as there is no support for such annotation in the formulas, which Martin suggested to be a possible flaw of the method. 
  
-Karel Bílek also noted that it is hard to annotate semantics on incorrect sentenceswhich is not mentioned in the paper.+==== Short Introduction to NLG ==== 
 +  * The input usually differs a lot depending on the systemit is usually already structured 
 +  * Common scheme of an NLG (natural language generation) system (simplified): 
 +    * content planning 
 +    * content structuring 
 +    * surface realization 
 +    (This paper is concerned mainly about the second and the third topic.
 +  * usual placing of the NLG in a dialogue system: 
 +    * input -> SLU (Spoken language understanding) -> Dialogue Manager -> **NLG** -> TTS (text to speech) -> output
  
-===== 4 Meta-evaluation methodology ===== +==== Section 1 ==== 
-Herewe reminded the difference between Kendall's τ and Spearman's τKendall's τ only takes the ranks into account, disregarding the actual scoreswhile Spearman's τ takes the scores into account. The formula for [[http://en.wikipedia.org/wiki/Kendall%27s_tau|Kendall's τ]] is τ = (#same_ordered_pairs - #opposite_ordered_pairs/ #all_pairs.+The BAGEL system tries to employ statistics in the NLG field earlier than any of the previous systemsalready in the generation phaseThe previous systems used statistics either only in the reranking of the generated resultsor in a preference of a generating rules (or templates), which were however handcrafted.
  
-Martin also remarked that the authors use sentence-level BLEU to compute the correlation; howeverBLEU was designed for whole documentsnot for individual sentencesand therefore should preferably not be used on sentence level.+==== Section 2 Semantic Representation ==== 
 +  * "Input": set of mandatory stacks; 
 +  * Task: creating the surface form, i.e. adding the optional intermediary stacks between the mandatory stacks and ordering everything. 
 +  * Semantic stack (corresponding to a //phrase// in the surface form) is considered to be an atomic unit. 
 +  * Note about the annotation form: It is inspired by a common notation from the dialogue systems and it is quite bound to the domain of the information dialogs (finding a restaurantbooking a flight etc.). Lots of other language phenomena would be very difficult to capture this waye.g. quantifiers, ranking... (Even the word "best" in the exercise is very simplified, used just as a part of phrase, not exactly bearing its expected meaning.)
  
-===== 6 Experiment: Monolinguals vs. bilinguals ===== +==== Section 3 - DBN ==== 
-Petr notes that, although it might seem surprising that monolinguals perform better in the evaluation than bilinguals, it is probably consequence of the fact that bilinguals try to guess what the source waswhile the monolinguals cannot do that.+  * Bayesian Network - a special type of a graphic model 
 +  * Dynamic Bayesian Network - BN with an interpretation by time (in a way similar to HMM) 
 +  * frame (see Fig. 1 and 2) - "a copy of the BN in particular time"corresponds to 1 stack
  
-**All other sections were basically skipped.**+  Note: above Eq. 2 - the assumption reformulated: "The optimal sequence of realizations can be obtained from the optimal sequence of stacks." 
 +    -> 2 models: the 1st one looking for the optimal sequence of stacks, the 2nd one looking for the realization of the phrase 
 +  * Note about the graphical model approach: It is nice and it can be quite illustrative, but since there are so many approximations (under the given assumptions), probably a much simpler approach would be enough (even the authors admit this). Our guess is that the authors had been already using this approach with the tools for some other experiment(s) (for example, the graphical models are usual for the dialog manager modeling) so it was convenient for them to try it also for this one. 
 +  Smoothing (3.1) - dividing the stack to head | tail 
 +    backoff smoothing used (in contrast to interpolated), but probably with a certain weighting for the backoff layers 
 +    * order of variable dropping - based on the combination of their importance and (expected) observed frequency of their values in the data 
 +  * Concept abstraction (3.2) 
 +    * non-enumerable semantic concepts (e.g. names) - sparse values -> replacing with 'X' (even as a context) 
 +    * //cardinality// - which is which? - (a name of) a pub has a greater cardinality, because there are lots of (names of) pubs 
 +    * Note: Regarding the data size of their experiments, this might not be necessary in this case.
  
-===== Final Objections ===== +==== Section 4 - Active Learning ==== 
-For the rest of the sessionMartin took the lead to express some more objections to the paper. The group agreed with the objectionsand even added some more.+  * Idea: Given initially not much training data, the system itself decides which data it would like to have annotated more to train its "weaker" areas. 
 +  * parameter //k// - during one iteration, how many sentences the system asks for to be annotated 
 +    * in the end, //k// = 1 is considered probably the best... 
 +  * The active learning is only simulated in the experimentsjust to get a hint whether it would be meaningful.
  
-Table 3 seems to represent the main results of the paper.It is shocking that the authors used **only 40 sentences**; moreover, they used it as **both the training set and the test set**+==== Section 5 - Methodology ==== 
-The grid search they use to tune the parameters means to "try everything and find the best-correlating parametersin this case this is 12 parameters. They ran the grid search optimization on the 40 sentences they have, but then they evaluated HMEANT on the same data. The group agreed that such evaluation is completely flawed and it is not clear why it was performed and included in the paper. +  * just 2 types of dialogue acts - //inform// and //reject// 
-Karel Bílek also notes that it is quite ridiculous to state the precision to 4 decimal digits when only 40 sentences are used.+  Amazon's Mechanical Turk as the annotators' platform 
 +  evaluation - both BLEU and human (5-point scale) - good for them for not relying only on BLEU 
 +  notes about active learning results: 
 +    helps "only" in the range of 40-100 sentences 
 +  * results quite successful
  
-In Table 4, the authors probably try to compensate for this flaw by performing cross-validation. However, please note there are only 10 sentences in one fold. Petr thinks that the table should show that the parameter weights are stable. However, Martin thinks that for only 40 sentences, it is probably easy to find 12 parameter values to achieve good performance. Moreover, Aleš Tamchyna assumes that even the formulas used might be fitted to those 40 sentences. 
  
-Martin then informed the group that Dekai Wu has still not given us the data from the annotations done on ÚFAL (which was already several months ago)which makes us even more suspicious whether the experiments were fair.+==== Overall Notes and Summary ==== 
 +Nice ideas, namely: 
 +  * Dynamic Bayes Networks 
 +  * Active learning 
 +  * evaluation not only by BLEUbut also human
  
-Martin also notes that the authors claim that all other existing evaluation metrics require lexical matches to consider a translation to be correct - which is not true, as the Meteor metric can also use paraphrases. 
  
-The group generally agreed that, although the ideas behind HMEANT seem reasonable, the paper itself is misleading and is not to be believed much (or probably at all). The proposed metric possibly correlates better with human judgment than automatic metrics, but it does not really seem to reach HTER.+One idea to think about (concerning the general NLG)
 +Would it be meaningful to cycle NLG with SLU for the sake of (automatic) training?

[ Back to the navigation ] [ Back to the content ]