Table of Contents
Automatic Domain Adaptation for Parsing
David McClovsky, Eugene Charniak, Mark Johnson (ACL 2010)
Presented by: Nathan Green
Report by: Katerina Topilova
Comments
Summary:
Idea – when parsing large data from diverse domains, it is useful for parsers to be able to generalize to a variety of domains.
The result is a system that proposes linear combinations of parsing models trained on the source corpora.
- Uses regression to predict f-scores
- Features – CosineTop50, UnkWords and Entropy
- Training data for the regressor consists of examples of source domain mixtures and their actual f-scores on target texts
Evaluation – 2 scenarios – out-of-domain evaluation, in-domain evaluation
Baselines – Uniform, Self-Trained Uniform, Fixed Set: WSJ, Best Single Corpus, Best Seen, Best Overall
Feature selection – round-robin tuning scenario
Results:
- Self-trained corpora are beneficial
- This model is the best non-oracle system for both scenarios
- Only 0.3% worse than Best Seen for out-of domain
- Within 0.6% of the Best Seen for in-domain
- 0.7% better than Best Overall model
What do we dislike about the paper:
- CosineTop50 feature is not sufficiently explained and leaves more possibilities of understanding
- we could take the top 50 words from first corpus and their frequencies and find frequencies of these words in the second corpus and then compare
- we could take the top 50 words from both corpora and compare only the frequencies, where these words overlap
- we could find the top 50 words that the two corpora have in common and then compare the frequencies
- Entropy feature could also have been more explained - we have agreed on RG, that this is the entropy of the distribution over mixtures of source domains, taking as better the mixtures, where the entropy value is closer to maximum
- Only Genia and SWBD are really different and the other perform well on WSJ, more domains would be better
- Overall, it would be very hard to implement this system based solely on this article
What do we like about the paper:
- Results are good