======  Phrase-based Statistical Language Generation using Graphical Models and Active Learning ====== 
// François Mairesse, Milica Gašić, Filip Jurčíček, Simon Keizer, Blaise Thomson, Kai Yu, Steve Young //
ACL 2010
[[http://aclweb.org/anthology-new/P/P10/P10-1157.pdf]]

Presented by Ondřej Dušek
Report by Honza Václ

==== Pre-sent Exercises ====
The following exercises were sent to the mail conference few days before and were aimed to make the readers think about the semantic stack representation used in the paper. They were not answered in very much detail in the lecture, just gone through to make sure we understand the basic concepts. Thus, the following solutions are mostly my own interpretation and are not guaranteed to be 100 percent correct.

**Ex. 1)** Try to think of a semantic stack representation and a dialogue act for the following sentence:
"For Chinese food, The Golden Palace restaurant is the best one in the centre of the city. It is located on the side of the river near East Road."
The solution could look something like this (inspired by Table 1):
^ surface form | for Chinese food | The Golden Palace | restaurant | is the best one | in the | centre of the city | It is located | on the side of the river | near | East Road |
^ sem. stack   |  Chinese  |  The Golden Palace  |  restaurant  |  |  |  centre  |  |  riverside  |  |  East Road  |
^  |  food    |  name    |  type     |          |  area    |  area    |  area    |  area    |  near    |  name    |
^  |  inform  |  inform  |  inform   |  inform  |  inform  |  inform  |  inform  |  inform  |  inform  |  inform  |

Note: The semantic stack for a given surface phrase is always the whole column (but **excluding** the "surface form" row).

**Ex. 2)** Try to think of a surface realization for the following dialogue act:
reject(type=placetoeat,eattype=restaurant,pricerange=cheap,area=citycentre,food=Indian)
Solution (one of many possible):
"I am sorry but I have no information about any cheap Indian restaurant in the centre of the city."

Note: Ondřej admitted that the syntax of the dialogue act in the Ex. 2 was not exactly the same as the one used throughout the paper, but rather taken out directly from the corpus used by the authors (publicly available).


==== Short Introduction to NLG ====
  * The input usually differs a lot depending on the system, it is usually already structured
  * Common scheme of an NLG (natural language generation) system (simplified):
    * content planning
    * content structuring
    * surface realization
    (This paper is concerned mainly about the second and the third topic.)
  * usual placing of the NLG in a dialogue system:
    * input -> SLU (Spoken language understanding) -> Dialogue Manager -> **NLG** -> TTS (text to speech) -> output

==== Section 1 ====
The BAGEL system tries to employ statistics in the NLG field earlier than any of the previous systems, already in the generation phase. The previous systems used statistics either only in the reranking of the generated results, or in a preference of a generating rules (or templates), which were however handcrafted.

==== Section 2 - Semantic Representation ====
  * "Input": set of mandatory stacks;
  * Task: creating the surface form, i.e. adding the optional intermediary stacks between the mandatory stacks and ordering everything.
  * Semantic stack (corresponding to a //phrase// in the surface form) is considered to be an atomic unit.
  * Note about the annotation form: It is inspired by a common notation from the dialogue systems and it is quite bound to the domain of the information dialogs (finding a restaurant, booking a flight etc.). Lots of other language phenomena would be very difficult to capture this way, e.g. quantifiers, ranking... (Even the word "best" in the exercise is very simplified, used just as a part of phrase, not exactly bearing its expected meaning.)

==== Section 3 - DBN ====
  * Bayesian Network - a special type of a graphic model
  * Dynamic Bayesian Network - BN with an interpretation by time (in a way similar to HMM)
  * frame (see Fig. 1 and 2) - "a copy of the BN in particular time", corresponds to 1 stack

  * Note: above Eq. 2 - the assumption reformulated: "The optimal sequence of realizations can be obtained from the optimal sequence of stacks."
    * -> 2 models: the 1st one looking for the optimal sequence of stacks, the 2nd one looking for the realization of the phrase
  * Note about the graphical model approach: It is nice and it can be quite illustrative, but since there are so many approximations (under the given assumptions), probably a much simpler approach would be enough (even the authors admit this). Our guess is that the authors had been already using this approach with the tools for some other experiment(s) (for example, the graphical models are usual for the dialog manager modeling) so it was convenient for them to try it also for this one.
  * Smoothing (3.1) - dividing the stack to head | tail
    * backoff smoothing used (in contrast to interpolated), but probably with a certain weighting for the backoff layers
    * order of variable dropping - based on the combination of their importance and (expected) observed frequency of their values in the data
  * Concept abstraction (3.2)
    * non-enumerable semantic concepts (e.g. names) - sparse values -> replacing with 'X' (even as a context)
    * //cardinality// - which is which? - (a name of) a pub has a greater cardinality, because there are lots of (names of) pubs
    * Note: Regarding the data size of their experiments, this might not be necessary in this case.

==== Section 4 - Active Learning ====
  * Idea: Given initially not much training data, the system itself decides which data it would like to have annotated more to train its "weaker" areas.
  * parameter //k// - during one iteration, how many sentences the system asks for to be annotated
    * in the end, //k// = 1 is considered probably the best...
  * The active learning is only simulated in the experiments, just to get a hint whether it would be meaningful.

==== Section 5 - Methodology ====
  * just 2 types of dialogue acts - //inform// and //reject//
  * Amazon's Mechanical Turk as the annotators' platform
  * evaluation - both BLEU and human (5-point scale) - good for them for not relying only on BLEU
  * notes about active learning results:
    * helps "only" in the range of 40-100 sentences
  * results quite successful


==== Overall Notes and Summary ====
Nice ideas, namely:
  * Dynamic Bayes Networks
  * Active learning
  * evaluation not only by BLEU, but also human


One idea to think about (concerning the general NLG):
Would it be meaningful to cycle NLG with SLU for the sake of (automatic) training?