[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Parser Adaptation

Mnoho starších poznámek mám ve Wordu.

Yet older notes exist at UMIACS NLP Wiki.

setenv PADAPT /net/work/people/zeman/padapt
cd $PADAPT

This thing is not (yet) under version control.

There is a Makefile in $PADAPT. make all should take care of all numbers in all experiments. However, some of the procedures (especially reranker training) are better run separately on the LRC cluster.

To do

Notes copied from the UMIACS wiki

This page describes an experiment conducted by I am going to move around some stuff, especially that in my home folder. * ''$PARSINGROOT'' - working copy of the parsers and related scripts. See [[Parsing on how to create your own.

Paper plans

EMNLP 2007 (Prague) could be a good forum. The submission deadline is March 26. Roughly speaking, we could spend February with further experiments. The three weeks in March would be devoted to writing the paper and possibly minor experiments whose necessity arises during writing.

Note: CoNLL shared task 2007 has a domain adaptation track (English) that is also quite related to what we do. However, it is domain adaptation (as opposed to language adaptation; the language here is English) and it is dependency parsing, while we have conducted all our experiments so far with constituent trees. Anyway, if we have time, we can try this as well. Dan has the training data and the test phase will start right after the deadline for EMNLP.

Agenda

Done

No tricks, no Acquis

Acquis

To do

<div style='background:yellow'>

</div>

Baselines without Acquis

See Danish Dependency Treebank for baseline Danish results. Summary: Charniak achieves F = 73.02 %, Brown 72.90 %.

Train Danish, parse Swedish

Note: This means that most words are unknown.

Charniak

cd /fs/clip-corpora/conll/swedish
$PARSINGROOT/charniak-parser/scripts/parse.pl -g ../danish/train.ecdata.tgz < dtest.txt > dtest.daec.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.daec.penn

P = 37.03 %, R = 35.11 %, F = 36.04 %, T = 20.62 % (T is tagging accuracy). Evaluated 317 sentences.

Brown

$PARSINGROOT/brown-reranking-parser/scripts/parse.pl -g ../danish/train.ecmj.tgz < dtest.txt > dtest.daecmj.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.daecmj.penn

P = 38.23 %, R = 35.37 %, F = 36.74 %, T = 20.96 % (T is tagging accuracy). Evaluated 317 sentences.

Glosses

One way of making unknown words known is to translate Swedish to Danish, parse the translated text and “restuff” the trees with the original Swedish words. It is not easy to do a good translation and if the translated text does not have the same number of words as the source (which is normal), we will not know how to retranslate the parse back to Swedish. However, Danish and Swedish are closely related, so we can derive a glossary from aligned corpora, perform a word-by-word translation, and be confident enough that the resulting sequence of words will keep the morphosyntactic properties of the original. Acquis is a parallel corpus that ships with an automatic paragraph-level alignment. Running Hiero steps 1 and 2 is all we need to get the sv-da glossary.

The following commands:

  1. translate Swedish test data word-by-word to Danish
  2. parse translated test data
  3. restuff parses with original Swedish words
  4. evaluate restuffed parses

Charniak

$PARSINGROOT/tools/wbwtranslate.pl -g /fs/clip-corpora/Europarl/acquis/glossary-sv-da.txt < dtest.txt \
    > dtest.dagloss.txt
$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -n 1 -g ../danish/train.ecdata.tgz < dtest.dagloss.txt \
    -o dtest.dagloss.daec.penn
$PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.dagloss.daec.penn \
    > dtest.dagloss.daec.sv.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.dagloss.daec.sv.penn \
    | tee result.dagloss.daec.txt

P = 50.03 %, R = 55.19 %, F = 52.48 %, T = 30.26 %.

Brown

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -n 1 -g ../danish/train.ecmj.tgz < dtest.dagloss.txt \
    -o dtest.dagloss.daecmj.penn
$PARSINGROOT/tools/restuff.pl -s dtest.txt < dtest.dagloss.daecmj.penn \
    > dtest.dagloss.daecmj.sv.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.dagloss.daecmj.sv.penn \
    | tee result.dagloss.daecmj.txt

P = 51.01 %, R = 55.25 %, F = 53.05 %, T = 29.99 %.

Delexicalized data

Another way of getting rid of unknown words is to replace words by their tags. We use Danish/Swedish tags that bear more information than Penn tags. The Scandinavian tags are now treated as words, while the pre-terminal level is still occupied by the less descriptive Penn tags.

See Danish Dependency Treebank and Talbanken05, respectively, for details on delexicalization and delexicalized parsing over each of the treebanks. See Delexicalized_Swedish_Acquis'>here on how the Swedish Acquis was delexicalized.

Again, we perform the following steps:

<b>Charniak without Acquis</b> <b>Charniak with Acquis</b>
1. Delexicalize Danish training data. Delexicalize Danish training data.
2. Delexicalize Swedish test data. Delexicalize Swedish Acquis.
3. Train Charniak on delexicalized Danish training data. Train Charniak on delexicalized Danish training data.
4. Use resulting parser to parse delexicalized Swedish test data. Use resulting parser to parse delexicalized Swedish Acquis.
5. Restuff trees with Swedish words. Restuff Acquis trees with Swedish words.
6. Evaluate Train Charniak on restuffed Acquis trees.
7. Use resulting parser to parse (undelexicalized) Swedish test data.
8. Evaluate.

We have to replace Hajič tags in tagged Swedish Acquis by tags compatible with <a href='DDT]].

$PARSINGROOT/charniak-parser/scripts/parse.pl \
    -g ../danish/train.delex.ec.tgz \
    < dtest.delex.txt \
    | ~/projekty/stanford/tools/restuff.pl -s dtest.txt \
    > dtest.delex.ec.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm dtest.penn dtest.delex.ec.penn

P = 50.85 %, R = 55.25 %, F = 52.96 %, T = 31.70 %. Hodnoceno 317 vět.


[ Back to the navigation ] [ Back to the content ]