Differences

This shows you the differences between two versions of the page.

--- user:zeman:parser-adaptation [2008/03/14 13:53]
zeman vytvořeno
+++ user:zeman:parser-adaptation [2008/07/09 17:12] (current)
zeman
@@ Line 2: / Line 2: @@
 Mnoho starších poznámek mám ve [[file://C:\Documents and Settings\Dan\Dokumenty\Lingvistika\Projekty\zaznam.mht|Wordu]].
+Yet older notes exist at [[https://wiki.cs.umd.edu/nlpwiki/index.php?title=Parser_Adaptation|UMIACS NLP Wiki]].
+<code>setenv PADAPT /net/work/people/zeman/padapt
+cd $PADAPT</code>
+This thing is not (yet) under version control.
+There is a ''Makefile'' in ''$PADAPT''. ''make all'' should take care of all numbers in all experiments. However, some of the procedures (especially reranker training) are better run separately on the LRC cluster.
+===== To do =====
+  * Classifier combination. Merge Charniak n-best lists from gloss and delex (and acquis-gloss and acquis-delex). Either let them vote, or let the reranker select from the merged list.
+  * Find a way to estimate trustworthiness of parses in self-training. Charniak’s rank in n-best list? Voting of three or more parsers? Some sort of sentence classifier that would estimate how easy it is for Charniak to parse? Maybe start with short sentences only? If we can reliably say what parses by Charniak can be trusted, we can restrict self training to those and see what happens.
+  * Immediately follows the bootstrapping experiment we discussed earlier. It failed but maybe it would not, had we been able to distinguish good from bad examples.
+  * Reverse the gloss experiment. “Translate” (gloss) Danish training data to Swedish, train a parser on it, test the parser on Swedish.
+  * Refine glossing. Translate N (100?) most frequent words. Of the rest, translate only M (4? 5?)-letter suffixes. Suffixation is another way of delexicalization and it would also be interesting how it affects accuracy of monolingual parsing.
+  * Explore the learning curve of the adapted parsers, as one of the reviewers suggested. Gradually increase the amount of Danish training data, monitor the changes in Swedish parsing accuracy.
+  * Test other languages with different degrees of relatedness. No new research, thus low priority; on the other hand, this could give answer to the doubts one of the reviewers had.
+  * Test a dependency parser in similar setting. After all, the treebanks we work with are of dependency origin.
+  * As reviewers suggested more error analysis. What is the nature of the most frequent errors that the adapted parser does and the Danish parser does not? Are they caused by lexical divergences? Morpho-syntactic? Domain mismatch?
+  * Use Swedish training data instead of Acquis. Strip the structure from them but keep the gold-standard POS tags. This experiment could show the impact of tagging errors. It provides a different domain, too.
+====== Notes copied from the UMIACS wiki ======
+This page describes an experiment conducted by [[User:Zeman'>Dan Zeman</a> in November and December 2006.
+=====Paths=====
+Note: I am going to move around some stuff, especially that in my home folder.
+  * ''$PARSINGROOT'' - working copy of the parsers and related scripts. See [[Parsing]] on how to create your own.
+  * ''/nfshomes/zeman/nastroje/morfologie/tagset'' - tagset drivers for tagset mapping
+  * ''/nfshomes/zeman/nastroje/taggery/hajic-sv/2006-11-08'' - Swedish morphological tagger by Jan Haji&#269;
+  * ''/fs/clip-corpora/Europarl/acquis'' - Danish and Swedish Acquis corpus, including everything I made of it
+=====Paper plans=====
+[[http://cs.jhu.edu/EMNLP-CoNLL-2007/|EMNLP 2007]] (Prague) could be a good forum. The submission deadline is March 26. Roughly speaking, we could spend February with further experiments. The three weeks in March would be devoted to writing the paper and possibly minor experiments whose necessity arises during writing.
+Note: [[http://nextens.uvt.nl/depparse-wiki/SharedTaskWebsite|CoNLL shared task 2007]] has a **domain** adaptation track (English) that is also quite related to what we do. However, it is domain adaptation (as opposed to language adaptation; the language here is English) and it is dependency parsing, while we have conducted all our experiments so far with constituent trees. Anyway, if we have time, we can try this as well. Dan has the training data and the test phase will start right after the deadline for EMNLP.
+  * Motivation
+  * Using related languages
+    * Similarities and differences
+    * Mapping POS tag set
+      * Evaluation
+  * Experiments
+    * Baseline 1: train Danish treebank, test Swedish treebank
+    * Experiment 1: Danish glosses
+    * Reranking
+    * Delexicalized
+    * Classifier combo
+    * Bootstrapping
+  * Error analysis
+  * How useful is it for a new pair of languages?
+=====Agenda=====
+====Done====
+===No tricks, no Acquis===
+  * Get [[Danish treebank]].
+  * Normalize it (structural annotation guidelines).
+    * Coordinations
+    * Prepositional phrases
+    * Subordinated clauses
+    * Punctuation (final, quotes, brackets...)
+    * Numbers, dates, personal names
+    * Auxiliaries
+  * Convert it to the Penn Treebank format.
+  * Split training, development and evaluation data.
+  * Train Danish Charniak and test it on dev data.
+  * Train Danish Brown and test it on dev data.
+  * Get [[Swedish treebank]].
+  * Normalize it (structural annotation guidelines).
+  * Convert it to the Penn Treebank format.
+  * Split training, development and evaluation data.
+  * Train Swedish Charniak and test it on dev data.
+  * Train Swedish Brown and test it on dev data.
+  * Get Swedish Charniak learning curve. Train on 50, 100, 200, 500, 1000, 2000, 5000, ALL Swedish sentences. Test on Swedish dev data.
+  * Test Danish Charniak on Swedish dev data.
+  * Test Danish Brown on Swedish dev data.
+===Acquis===
+  * Get big Swedish data (JRC [[Acquis]] corpus).
+    * Extract plain text from the TEI XML.
+    * Tokenize it.
+    * Find sentence boundaries.
+    * Tag it morphologically.
+      * The [[Hajič tagger]] requires that the input be converted to CSTS format and ISO 8859-1 encoding.
+  * Parse Swedish Acquis using Danish Brown.
+  * Train Charniak on parsed Swedish Acquis.
+  * Test it on Swedish dev data.
+  * Modify: train on parsed Swedish Acquis + Danish Treebank.
+  * Modify: train Charniak as above, combine with reranker trained on Danish Treebank.
+====To do====
+<div style='background:yellow'>
+  * Unify the tag sets of both treebanks (da, sv) and of the output of the Hajič tagger.
+  * Repeat the whole process above with various methods of making the Swedish data look like Danish:
+    * Regular expressions for changing typical spelling patterns.
+    * Use alignment to find similar (minimal edit distance) words in both languages, than replace the words.
+    * Use morphology. Replace terminals by preterminals, then train and test the parsers and rerankers.
+    * Use morphology but do not replace terminals by preterminals. Instead, only try to impose your tags on the parser. (Stanford parser should be able to take them. I am not sure about the Charniak parser.)
+</div>
+=====Baselines without Acquis=====
+See [[Danish Dependency Treebank]] for baseline Danish results. Summary: Charniak achieves F = <html><span style='background:yellow|74.99</span>&nbsp;%, Brown 75.81&nbsp;%.
+See [[Talbanken05]] for baseline Swedish results. Summary: Charniak achieves <span style='background:yellow'>73.02</span></html>&nbsp;%, Brown 72.90&nbsp;%.
+====Train Danish, parse Swedish====
+//Note:// This means that most words are unknown.
+===Charniak===
+<code>cd /fs/clip-corpora/conll/swedish
+$PARSINGROOT/charniak-parser/scripts/parse.pl -g ../danish/train.ecdata.tgz < dtest.txt > dtest.daec.penn
+$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.daec.penn</code>
+P = 37.03 %, R = 35.11 %, F = <html><span style='background:yellow'>36.04</span></html> %, T = 20.62 % (T is tagging accuracy). Evaluated 317 sentences.
+===Brown===
+<code>$PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g ../danish/train.ecmj.tgz < dtest.txt > dtest.daecmj.penn
+$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.daecmj.penn</code>
+P = 38.23 %, R = 35.37 %, F = <html><span style='background:yellow'>36.74</span></html> %, T = 20.96 % (T is tagging accuracy). Evaluated 317 sentences.
+====Glosses====
+One way of making unknown words known is to translate Swedish to Danish, parse the translated text and “restuff” the trees with the original Swedish words. It is not easy to do a good translation and if the translated text does not have the same number of words as the source (which is normal), we will not know how to retranslate the parse back to Swedish. However, Danish and Swedish are closely related, so we can derive a glossary from aligned corpora, perform a word-by-word translation, and be confident enough that the resulting sequence of words will keep the morphosyntactic properties of the original. [[Acquis]] is a parallel corpus that ships with an automatic paragraph-level alignment. Running [[Hiero]] steps 1 and 2 is all we need to get the sv-da glossary.
+The following commands:
+  - translate Swedish test data word-by-word to Danish
+  - parse translated test data
+  - restuff parses with original Swedish words
+  - evaluate restuffed parses
+===Charniak===
+<code>$PARSINGROOT/tools/wbwtranslate.pl -g /fs/clip-corpora/Europarl/acquis/glossary-sv-da.txt < dtest.txt \
+    > dtest.dagloss.txt
+$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -n 1 -g ../danish/train.ecdata.tgz < dtest.dagloss.txt \
+    -o dtest.dagloss.daec.penn
+$PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.dagloss.daec.penn \
+    > dtest.dagloss.daec.sv.penn
+$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.dagloss.daec.sv.penn \
+    | tee result.dagloss.daec.txt</code>
+P = 50.03 %, R = 55.19 %, F = <html><span style='background:yellow'>52.48</span></html> %, T = 30.26 %.
+===Brown===
+<code>$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -n 1 -g ../danish/train.ecmj.tgz < dtest.dagloss.txt \
+    -o dtest.dagloss.daecmj.penn
+$PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.dagloss.daecmj.penn \
+    > dtest.dagloss.daecmj.sv.penn
+$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.dagloss.daecmj.sv.penn \
+    | tee result.dagloss.daecmj.txt</code>
+P = 51.01 %, R = 55.25 %, F = <html><span style='background:yellow'>53.05</span></html> %, T = 29.99 %.
+====Delexicalized data====
+Another way of getting rid of unknown words is to replace words by their tags. We use Danish/Swedish tags that bear more information than Penn tags. The Scandinavian tags are now treated as words, while the pre-terminal level is still occupied by the less descriptive Penn tags.
+See [[Danish Dependency Treebank]] and [[Talbanken05]], respectively, for details on delexicalization and delexicalized parsing over each of the treebanks. See [[Acquis#Delexicalized_Swedish_Acquis'>here]] on how the Swedish Acquis was delexicalized.
+Again, we perform the following steps:
+|  | <b>Charniak without Acquis</b> | <b>Charniak with Acquis</b> |
+| 1. | Delexicalize Danish training data. | Delexicalize Danish training data. |
+| 2. | Delexicalize Swedish test data. | Delexicalize Swedish Acquis. |
+| 3. | Train Charniak on delexicalized Danish training data. | Train Charniak on delexicalized Danish training data. |
+| 4. | Use resulting parser to parse delexicalized Swedish test data. | Use resulting parser to parse delexicalized Swedish Acquis. |
+| 5. | Restuff trees with Swedish words. | Restuff Acquis trees with Swedish words. |
+| 6. | Evaluate | Train Charniak on restuffed Acquis trees. |
+| 7. |  | Use resulting parser to parse (undelexicalized) Swedish test data. |
+| 8. |  | Evaluate. |
+We have to replace Hajič tags in tagged Swedish Acquis by tags compatible with <a href='DDT]].
+<code>$PARSINGROOT/charniak-parser/scripts/parse.pl \
+    -g ../danish/train.delex.ec.tgz \
+    < dtest.delex.txt \
+    | ~/projekty/stanford/tools/restuff.pl -s dtest.txt \
+    > dtest.delex.ec.penn
+$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.penn</code>
+P = 50.85 %, R = 55.25 %, F = <html><span style='background:yellow'>52.96</span></html> %, T = 31.70 %. Hodnoceno 317 vět.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences