===== Automatic Domain Adaptation for Parsing =====
David McClovsky, Eugene Charniak, Mark Johnson (ACL 2010)

Presented by: Nathan Green
Report by: Katerina Topilova

===== Comments =====

==== Summary: ====
**Idea** – when parsing large data from diverse domains, it is useful for parsers to be able to generalize to a variety of domains. 
The result is a system that proposes linear combinations of parsing models trained on the source corpora.
  * Uses regression to predict f-scores
  * Features – CosineTop50, UnkWords and Entropy
  * Training data for the regressor consists of examples of source domain mixtures and their actual f-scores on target texts

Evaluation – 2 scenarios – out-of-domain evaluation, in-domain evaluation
Baselines – Uniform, Self-Trained Uniform, Fixed Set:  WSJ, Best Single Corpus, Best Seen, Best Overall
Feature selection – round-robin tuning scenario

**Results**: 
  * Self-trained corpora are beneficial
  * This model is the best non-oracle system for both scenarios
  * Only 0.3% worse than Best Seen for out-of domain
  * Within 0.6% of the Best Seen for in-domain
  * 0.7% better than Best Overall model

==== What do we dislike about the paper: ====

  * CosineTop50 feature is not sufficiently explained and leaves more possibilities of understanding
       - we could take the top 50 words from first corpus and their frequencies and find frequencies of these words in the second corpus and then compare 
       - we could take the top 50 words from both corpora and compare only the frequencies, where these words overlap
       - we could find the top 50 words that the two corpora have in common and then compare the frequencies
  * Entropy feature could also have been more explained - we have agreed on RG, that this is the entropy of the distribution over mixtures of source domains, taking as better the mixtures, where the entropy value is closer to maximum
  * Only Genia and SWBD are really different and the other perform well on WSJ, more domains would be better
  * Overall, it would be very hard to implement this system based solely on this article

==== What do we like about the paper: ====

  * Results are good