===== Automatic Domain Adaptation for Parsing ===== David McClovsky, Eugene Charniak, Mark Johnson (ACL 2010) Presented by: Nathan Green Report by: Katerina Topilova ===== Comments ===== ==== Summary: ==== **Idea** – when parsing large data from diverse domains, it is useful for parsers to be able to generalize to a variety of domains. The result is a system that proposes linear combinations of parsing models trained on the source corpora. * Uses regression to predict f-scores * Features – CosineTop50, UnkWords and Entropy * Training data for the regressor consists of examples of source domain mixtures and their actual f-scores on target texts Evaluation – 2 scenarios – out-of-domain evaluation, in-domain evaluation Baselines – Uniform, Self-Trained Uniform, Fixed Set: WSJ, Best Single Corpus, Best Seen, Best Overall Feature selection – round-robin tuning scenario **Results**: * Self-trained corpora are beneficial * This model is the best non-oracle system for both scenarios * Only 0.3% worse than Best Seen for out-of domain * Within 0.6% of the Best Seen for in-domain * 0.7% better than Best Overall model ==== What do we dislike about the paper: ==== * CosineTop50 feature is not sufficiently explained and leaves more possibilities of understanding - we could take the top 50 words from first corpus and their frequencies and find frequencies of these words in the second corpus and then compare - we could take the top 50 words from both corpora and compare only the frequencies, where these words overlap - we could find the top 50 words that the two corpora have in common and then compare the frequencies * Entropy feature could also have been more explained - we have agreed on RG, that this is the entropy of the distribution over mixtures of source domains, taking as better the mixtures, where the entropy value is closer to maximum * Only Genia and SWBD are really different and the other perform well on WSJ, more domains would be better * Overall, it would be very hard to implement this system based solely on this article ==== What do we like about the paper: ==== * Results are good