Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:parser-adaptation [2008/03/14 14:18] zeman |
user:zeman:parser-adaptation [2008/07/09 17:12] (current) zeman |
||
---|---|---|---|
Line 9: | Line 9: | ||
This thing is not (yet) under version control. | This thing is not (yet) under version control. | ||
+ | |||
+ | There is a '' | ||
===== To do ===== | ===== To do ===== | ||
Line 22: | Line 24: | ||
* As reviewers suggested more error analysis. What is the nature of the most frequent errors that the adapted parser does and the Danish parser does not? Are they caused by lexical divergences? | * As reviewers suggested more error analysis. What is the nature of the most frequent errors that the adapted parser does and the Danish parser does not? Are they caused by lexical divergences? | ||
* Use Swedish training data instead of Acquis. Strip the structure from them but keep the gold-standard POS tags. This experiment could show the impact of tagging errors. It provides a different domain, too. | * Use Swedish training data instead of Acquis. Strip the structure from them but keep the gold-standard POS tags. This experiment could show the impact of tagging errors. It provides a different domain, too. | ||
+ | |||
+ | ====== Notes copied from the UMIACS wiki ====== | ||
+ | |||
+ | This page describes an experiment conducted by [[User: | ||
+ | |||
+ | =====Paths===== | ||
+ | |||
+ | Note: I am going to move around some stuff, especially that in my home folder. | ||
+ | |||
+ | * '' | ||
+ | * ''/ | ||
+ | * ''/ | ||
+ | * ''/ | ||
+ | |||
+ | =====Paper plans===== | ||
+ | |||
+ | [[http:// | ||
+ | |||
+ | Note: [[http:// | ||
+ | |||
+ | * Motivation | ||
+ | * Using related languages | ||
+ | * Similarities and differences | ||
+ | * Mapping POS tag set | ||
+ | * Evaluation | ||
+ | * Experiments | ||
+ | * Baseline 1: train Danish treebank, test Swedish treebank | ||
+ | * Experiment 1: Danish glosses | ||
+ | * Reranking | ||
+ | * Delexicalized | ||
+ | * Classifier combo | ||
+ | * Bootstrapping | ||
+ | * Error analysis | ||
+ | * How useful is it for a new pair of languages? | ||
+ | |||
+ | |||
+ | =====Agenda===== | ||
+ | |||
+ | ====Done==== | ||
+ | |||
+ | ===No tricks, no Acquis=== | ||
+ | |||
+ | * Get [[Danish treebank]]. | ||
+ | * Normalize it (structural annotation guidelines). | ||
+ | * Coordinations | ||
+ | * Prepositional phrases | ||
+ | * Subordinated clauses | ||
+ | * Punctuation (final, quotes, brackets...) | ||
+ | * Numbers, dates, personal names | ||
+ | * Auxiliaries | ||
+ | * Convert it to the Penn Treebank format. | ||
+ | * Split training, development and evaluation data. | ||
+ | * Train Danish Charniak and test it on dev data. | ||
+ | * Train Danish Brown and test it on dev data. | ||
+ | |||
+ | * Get [[Swedish treebank]]. | ||
+ | * Normalize it (structural annotation guidelines). | ||
+ | * Convert it to the Penn Treebank format. | ||
+ | * Split training, development and evaluation data. | ||
+ | * Train Swedish Charniak and test it on dev data. | ||
+ | * Train Swedish Brown and test it on dev data. | ||
+ | * Get Swedish Charniak learning curve. Train on 50, 100, 200, 500, 1000, 2000, 5000, ALL Swedish sentences. Test on Swedish dev data. | ||
+ | |||
+ | * Test Danish Charniak on Swedish dev data. | ||
+ | * Test Danish Brown on Swedish dev data. | ||
+ | |||
+ | ===Acquis=== | ||
+ | |||
+ | * Get big Swedish data (JRC [[Acquis]] corpus). | ||
+ | * Extract plain text from the TEI XML. | ||
+ | * Tokenize it. | ||
+ | * Find sentence boundaries. | ||
+ | * Tag it morphologically. | ||
+ | * The [[Hajič tagger]] requires that the input be converted to CSTS format and ISO 8859-1 encoding. | ||
+ | |||
+ | * Parse Swedish Acquis using Danish Brown. | ||
+ | * Train Charniak on parsed Swedish Acquis. | ||
+ | * Test it on Swedish dev data. | ||
+ | * Modify: train on parsed Swedish Acquis + Danish Treebank. | ||
+ | * Modify: train Charniak as above, combine with reranker trained on Danish Treebank. | ||
+ | |||
+ | ====To do==== | ||
+ | |||
+ | <div style=' | ||
+ | * Unify the tag sets of both treebanks (da, sv) and of the output of the Hajič tagger. | ||
+ | * Repeat the whole process above with various methods of making the Swedish data look like Danish: | ||
+ | * Regular expressions for changing typical spelling patterns. | ||
+ | * Use alignment to find similar (minimal edit distance) words in both languages, than replace the words. | ||
+ | * Use morphology. Replace terminals by preterminals, | ||
+ | * Use morphology but do not replace terminals by preterminals. Instead, only try to impose your tags on the parser. (Stanford parser should be able to take them. I am not sure about the Charniak parser.) | ||
+ | </ | ||
+ | |||
+ | =====Baselines without Acquis===== | ||
+ | |||
+ | See [[Danish Dependency Treebank]] for baseline Danish results. Summary: Charniak achieves F = < | ||
+ | |||
+ | See [[Talbanken05]] for baseline Swedish results. Summary: Charniak achieves <span style=' | ||
+ | |||
+ | ====Train Danish, parse Swedish==== | ||
+ | |||
+ | //Note:// This means that most words are unknown. | ||
+ | |||
+ | ===Charniak=== | ||
+ | |||
+ | < | ||
+ | $PARSINGROOT/ | ||
+ | $PARSINGROOT/ | ||
+ | |||
+ | P = 37.03 %, R = 35.11 %, F = < | ||
+ | |||
+ | ===Brown=== | ||
+ | |||
+ | < | ||
+ | $PARSINGROOT/ | ||
+ | |||
+ | P = 38.23 %, R = 35.37 %, F = < | ||
+ | |||
+ | ====Glosses==== | ||
+ | |||
+ | One way of making unknown words known is to translate Swedish to Danish, parse the translated text and “restuff” the trees with the original Swedish words. It is not easy to do a good translation and if the translated text does not have the same number of words as the source (which is normal), we will not know how to retranslate the parse back to Swedish. However, Danish and Swedish are closely related, so we can derive a glossary from aligned corpora, perform a word-by-word translation, | ||
+ | |||
+ | The following commands: | ||
+ | - translate Swedish test data word-by-word to Danish | ||
+ | - parse translated test data | ||
+ | - restuff parses with original Swedish words | ||
+ | - evaluate restuffed parses | ||
+ | |||
+ | ===Charniak=== | ||
+ | |||
+ | < | ||
+ | > dtest.dagloss.txt | ||
+ | $PARSINGROOT/ | ||
+ | -o dtest.dagloss.daec.penn | ||
+ | $PARSINGROOT/ | ||
+ | > dtest.dagloss.daec.sv.penn | ||
+ | $PARSINGROOT/ | ||
+ | | tee result.dagloss.daec.txt</ | ||
+ | |||
+ | P = 50.03 %, R = 55.19 %, F = < | ||
+ | |||
+ | ===Brown=== | ||
+ | |||
+ | < | ||
+ | -o dtest.dagloss.daecmj.penn | ||
+ | $PARSINGROOT/ | ||
+ | > dtest.dagloss.daecmj.sv.penn | ||
+ | $PARSINGROOT/ | ||
+ | | tee result.dagloss.daecmj.txt</ | ||
+ | |||
+ | P = 51.01 %, R = 55.25 %, F = < | ||
+ | |||
+ | ====Delexicalized data==== | ||
+ | |||
+ | Another way of getting rid of unknown words is to replace words by their tags. We use Danish/ | ||
+ | |||
+ | See [[Danish Dependency Treebank]] and [[Talbanken05]], | ||
+ | |||
+ | Again, we perform the following steps: | ||
+ | |||
+ | | | < | ||
+ | | 1. | Delexicalize Danish training data. | Delexicalize Danish training data. | | ||
+ | | 2. | Delexicalize Swedish test data. | Delexicalize Swedish Acquis. | | ||
+ | | 3. | Train Charniak on delexicalized Danish training data. | Train Charniak on delexicalized Danish training data. | | ||
+ | | 4. | Use resulting parser to parse delexicalized Swedish test data. | Use resulting parser to parse delexicalized Swedish Acquis. | | ||
+ | | 5. | Restuff trees with Swedish words. | Restuff Acquis trees with Swedish words. | | ||
+ | | 6. | Evaluate | Train Charniak on restuffed Acquis trees. | | ||
+ | | 7. | | Use resulting parser to parse (undelexicalized) Swedish test data. | | ||
+ | | 8. | | Evaluate. | | ||
+ | |||
+ | |||
+ | We have to replace Hajič tags in tagged Swedish Acquis by tags compatible with <a href=' | ||
+ | |||
+ | < | ||
+ | -g ../ | ||
+ | < dtest.delex.txt \ | ||
+ | | ~/ | ||
+ | > dtest.delex.ec.penn | ||
+ | $PARSINGROOT/ | ||
+ | |||
+ | P = 50.85 %, R = 55.25 %, F = < | ||