Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:parser-adaptation [2008/03/14 13:55] zeman UMIACS NLP Wiki. |
user:zeman:parser-adaptation [2008/07/09 17:12] (current) zeman |
||
---|---|---|---|
Line 4: | Line 4: | ||
Yet older notes exist at [[https:// | Yet older notes exist at [[https:// | ||
+ | |||
+ | < | ||
+ | cd $PADAPT</ | ||
+ | |||
+ | This thing is not (yet) under version control. | ||
+ | |||
+ | There is a '' | ||
+ | |||
+ | ===== To do ===== | ||
+ | |||
+ | * Classifier combination. Merge Charniak n-best lists from gloss and delex (and acquis-gloss and acquis-delex). Either let them vote, or let the reranker select from the merged list. | ||
+ | * Find a way to estimate trustworthiness of parses in self-training. Charniak’s rank in n-best list? Voting of three or more parsers? Some sort of sentence classifier that would estimate how easy it is for Charniak to parse? Maybe start with short sentences only? If we can reliably say what parses by Charniak can be trusted, we can restrict self training to those and see what happens. | ||
+ | * Immediately follows the bootstrapping experiment we discussed earlier. It failed but maybe it would not, had we been able to distinguish good from bad examples. | ||
+ | * Reverse the gloss experiment. “Translate” (gloss) Danish training data to Swedish, train a parser on it, test the parser on Swedish. | ||
+ | * Refine glossing. Translate N (100?) most frequent words. Of the rest, translate only M (4? 5?)-letter suffixes. Suffixation is another way of delexicalization and it would also be interesting how it affects accuracy of monolingual parsing. | ||
+ | * Explore the learning curve of the adapted parsers, as one of the reviewers suggested. Gradually increase the amount of Danish training data, monitor the changes in Swedish parsing accuracy. | ||
+ | * Test other languages with different degrees of relatedness. No new research, thus low priority; on the other hand, this could give answer to the doubts one of the reviewers had. | ||
+ | * Test a dependency parser in similar setting. After all, the treebanks we work with are of dependency origin. | ||
+ | * As reviewers suggested more error analysis. What is the nature of the most frequent errors that the adapted parser does and the Danish parser does not? Are they caused by lexical divergences? | ||
+ | * Use Swedish training data instead of Acquis. Strip the structure from them but keep the gold-standard POS tags. This experiment could show the impact of tagging errors. It provides a different domain, too. | ||
+ | |||
+ | ====== Notes copied from the UMIACS wiki ====== | ||
+ | |||
+ | This page describes an experiment conducted by [[User: | ||
+ | |||
+ | =====Paths===== | ||
+ | |||
+ | Note: I am going to move around some stuff, especially that in my home folder. | ||
+ | |||
+ | * '' | ||
+ | * ''/ | ||
+ | * ''/ | ||
+ | * ''/ | ||
+ | |||
+ | =====Paper plans===== | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | Note: [[http:// | ||
+ | |||
+ | * Motivation | ||
+ | * Using related languages | ||
+ | * Similarities and differences | ||
+ | * Mapping POS tag set | ||
+ | * Evaluation | ||
+ | * Experiments | ||
+ | * Baseline 1: train Danish treebank, test Swedish treebank | ||
+ | * Experiment 1: Danish glosses | ||
+ | * Reranking | ||
+ | * Delexicalized | ||
+ | * Classifier combo | ||
+ | * Bootstrapping | ||
+ | * Error analysis | ||
+ | * How useful is it for a new pair of languages? | ||
+ | |||
+ | |||
+ | =====Agenda===== | ||
+ | |||
+ | ====Done==== | ||
+ | |||
+ | ===No tricks, no Acquis=== | ||
+ | |||
+ | * Get [[Danish treebank]]. | ||
+ | * Normalize it (structural annotation guidelines). | ||
+ | * Coordinations | ||
+ | * Prepositional phrases | ||
+ | * Subordinated clauses | ||
+ | * Punctuation (final, quotes, brackets...) | ||
+ | * Numbers, dates, personal names | ||
+ | * Auxiliaries | ||
+ | * Convert it to the Penn Treebank format. | ||
+ | * Split training, development and evaluation data. | ||
+ | * Train Danish Charniak and test it on dev data. | ||
+ | * Train Danish Brown and test it on dev data. | ||
+ | |||
+ | * Get [[Swedish treebank]]. | ||
+ | * Normalize it (structural annotation guidelines). | ||
+ | * Convert it to the Penn Treebank format. | ||
+ | * Split training, development and evaluation data. | ||
+ | * Train Swedish Charniak and test it on dev data. | ||
+ | * Train Swedish Brown and test it on dev data. | ||
+ | * Get Swedish Charniak learning curve. Train on 50, 100, 200, 500, 1000, 2000, 5000, ALL Swedish sentences. Test on Swedish dev data. | ||
+ | |||
+ | * Test Danish Charniak on Swedish dev data. | ||
+ | * Test Danish Brown on Swedish dev data. | ||
+ | |||
+ | ===Acquis=== | ||
+ | |||
+ | * Get big Swedish data (JRC [[Acquis]] corpus). | ||
+ | * Extract plain text from the TEI XML. | ||
+ | * Tokenize it. | ||
+ | * Find sentence boundaries. | ||
+ | * Tag it morphologically. | ||
+ | * The [[Hajič tagger]] requires that the input be converted to CSTS format and ISO 8859-1 encoding. | ||
+ | |||
+ | * Parse Swedish Acquis using Danish Brown. | ||
+ | * Train Charniak on parsed Swedish Acquis. | ||
+ | * Test it on Swedish dev data. | ||
+ | * Modify: train on parsed Swedish Acquis + Danish Treebank. | ||
+ | * Modify: train Charniak as above, combine with reranker trained on Danish Treebank. | ||
+ | |||
+ | ====To do==== | ||
+ | |||
+ | <div style=' | ||
+ | * Unify the tag sets of both treebanks (da, sv) and of the output of the Hajič tagger. | ||
+ | * Repeat the whole process above with various methods of making the Swedish data look like Danish: | ||
+ | * Regular expressions for changing typical spelling patterns. | ||
+ | * Use alignment to find similar (minimal edit distance) words in both languages, than replace the words. | ||
+ | * Use morphology. Replace terminals by preterminals, | ||
+ | * Use morphology but do not replace terminals by preterminals. Instead, only try to impose your tags on the parser. (Stanford parser should be able to take them. I am not sure about the Charniak parser.) | ||
+ | </ | ||
+ | |||
+ | =====Baselines without Acquis===== | ||
+ | |||
+ | See [[Danish Dependency Treebank]] for baseline Danish results. Summary: Charniak achieves F = < | ||
+ | |||
+ | See [[Talbanken05]] for baseline Swedish results. Summary: Charniak achieves <span style=' | ||
+ | |||
+ | ====Train Danish, parse Swedish==== | ||
+ | |||
+ | //Note:// This means that most words are unknown. | ||
+ | |||
+ | ===Charniak=== | ||
+ | |||
+ | < | ||
+ | $PARSINGROOT/ | ||
+ | $PARSINGROOT/ | ||
+ | |||
+ | P = 37.03 %, R = 35.11 %, F = < | ||
+ | |||
+ | ===Brown=== | ||
+ | |||
+ | < | ||
+ | $PARSINGROOT/ | ||
+ | |||
+ | P = 38.23 %, R = 35.37 %, F = < | ||
+ | |||
+ | ====Glosses==== | ||
+ | |||
+ | One way of making unknown words known is to translate Swedish to Danish, parse the translated text and “restuff” the trees with the original Swedish words. It is not easy to do a good translation and if the translated text does not have the same number of words as the source (which is normal), we will not know how to retranslate the parse back to Swedish. However, Danish and Swedish are closely related, so we can derive a glossary from aligned corpora, perform a word-by-word translation, | ||
+ | |||
+ | The following commands: | ||
+ | - translate Swedish test data word-by-word to Danish | ||
+ | - parse translated test data | ||
+ | - restuff parses with original Swedish words | ||
+ | - evaluate restuffed parses | ||
+ | |||
+ | ===Charniak=== | ||
+ | |||
+ | < | ||
+ | > dtest.dagloss.txt | ||
+ | $PARSINGROOT/ | ||
+ | -o dtest.dagloss.daec.penn | ||
+ | $PARSINGROOT/ | ||
+ | > dtest.dagloss.daec.sv.penn | ||
+ | $PARSINGROOT/ | ||
+ | | tee result.dagloss.daec.txt</ | ||
+ | |||
+ | P = 50.03 %, R = 55.19 %, F = < | ||
+ | |||
+ | ===Brown=== | ||
+ | |||
+ | < | ||
+ | -o dtest.dagloss.daecmj.penn | ||
+ | $PARSINGROOT/ | ||
+ | > dtest.dagloss.daecmj.sv.penn | ||
+ | $PARSINGROOT/ | ||
+ | | tee result.dagloss.daecmj.txt</ | ||
+ | |||
+ | P = 51.01 %, R = 55.25 %, F = < | ||
+ | |||
+ | ====Delexicalized data==== | ||
+ | |||
+ | Another way of getting rid of unknown words is to replace words by their tags. We use Danish/ | ||
+ | |||
+ | See [[Danish Dependency Treebank]] and [[Talbanken05]], | ||
+ | |||
+ | Again, we perform the following steps: | ||
+ | |||
+ | | | < | ||
+ | | 1. | Delexicalize Danish training data. | Delexicalize Danish training data. | | ||
+ | | 2. | Delexicalize Swedish test data. | Delexicalize Swedish Acquis. | | ||
+ | | 3. | Train Charniak on delexicalized Danish training data. | Train Charniak on delexicalized Danish training data. | | ||
+ | | 4. | Use resulting parser to parse delexicalized Swedish test data. | Use resulting parser to parse delexicalized Swedish Acquis. | | ||
+ | | 5. | Restuff trees with Swedish words. | Restuff Acquis trees with Swedish words. | | ||
+ | | 6. | Evaluate | Train Charniak on restuffed Acquis trees. | | ||
+ | | 7. | | Use resulting parser to parse (undelexicalized) Swedish test data. | | ||
+ | | 8. | | Evaluate. | | ||
+ | |||
+ | |||
+ | We have to replace Hajič tags in tagged Swedish Acquis by tags compatible with <a href=' | ||
+ | |||
+ | < | ||
+ | -g ../ | ||
+ | < dtest.delex.txt \ | ||
+ | | ~/ | ||
+ | > dtest.delex.ec.penn | ||
+ | $PARSINGROOT/ | ||
+ | |||
+ | P = 50.85 %, R = 55.25 %, F = < | ||