This page describes an experiment conducted by Dan Zeman in November and December 2006.
I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson (NAACL 2006, New York) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data.
Once the original self-training experiment works as expected, we are going to use a similar scheme for parser adaptation to a new language.
Note: I am going to move around some stuff, especially that in my home folder.
$PARSINGROOT
- working copy of the parsers and related scripts. See Parsing on how to create your own./fs/clip-corpora/north_american_news
- North American News Text Corpus, including everything I made of itWe have the parser and reranker (both together = reranking parser), both trained (probably) on Penn Treebank Wall Street Journal sections 2-21 (training done already at Brown).
$HIEROROOT/preprocess/tokenizeE.pl - - < plainText > tokenizedText
).We tested the pretrained reranking parser (PR0) on sections 22 (development) 23 (final evaluation test). All evaluations take into account only sentences of 40 or fewer words.
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.charniak $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.brown $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.charniak $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.brown
Section | 22 | 23 | ||
Parser | Charniak | Brown | Charniak | Brown |
Precision | 90.54 | 92.81 | 90.43 | 92.35 |
Recall | 90.43 | 91.92 | 90.21 | 91.61 |
F-score | 90.48 | 92.36 | 90.32 | 91.98 |
Tagging | 96.15 | 92.41 | 96.78 | 92.33 |
Crossing | 0.66 | 0.49 | 0.72 | 0.59 |
See North American News Text Corpus for more information on the data and its preparation.
See here for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command:
cd /fs/clip-corpora/north_american_news $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz < latwp.04.clean.txt \ -o latwp.05a.brown.penn -w workdir05 -k
Parsing takes about 80 CPU-hours.
The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (3,143,433 sentences).
$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn > ptb+latwp3000.tgz
The new non-reranking parser will be called P1. The reranking parser P1+R will be called PR1.
$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \ < $PTB/ptbwsj22.txt \ > ptbwsj22.ec.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \ < $PTB/ptbwsj23.txt \ > ptbwsj23.ec.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn
Section | 22 | 23 |
Precision | 87.74 | 88.26 |
Recall | 88.65 | 88.54 |
F-score | 88.19 | 88.40 |
Tagging | 92.67 | 92.84 |
Crossing | 0.80 | 0.91 |
First we combine the new parser with the old reranker.
$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \ > ptb+latwp3000.brown.tgz
Then we use the combined model to parse the test data.
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.br.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.br.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn
Section | 22 | 23 |
Precision | 90.39 | 90.68 |
Recall | 90.30 | 90.24 |
F-score | 90.34 | 90.46 |
Tagging | 93.43 | 93.65 |
Crossing | 0.61 | 0.71 |
McClosky et al. do not use all 3 million sentences. They found that the best results over their development data (section 22) were obtained by mixing 5 copies of the Penn Treebank Wall Street Journal sections 02-21 and (first?) 1,750,000 sentences from NANTC LATWP.
head -1750000 latwp.05a.brown.penn > latwp.1750k.brown.penn
Train Charniak on this mix. It takes more than 4 hours.
$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn > ptb+latwp1750.tgz
Create new Brown model: combine the new parser with the old reranker.
$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \ > ptb+latwp1750.brown.tgz
Parse the test sections and evaluate the results.
$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.ec.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.ec.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.br.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.br.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn
Train Charniak on 5 copies of PTB WSJ 02-21, without any trees from NANTC.
$PARSINGROOT/charniak-parser/scripts/train.pl < ptbwsj02-21.5times.penn > 5ptb.tgz
Parse sections 22 and 23 using the just trained model.
$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.ec.5ptb.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.ec.5ptb.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn
The following table combines tables from previous sections and from the McClosky et al. (2006) paper. I am not sure why I do not get the same baseline numbers as McClosky et al. Possibly they evaluate all sentences rather than just those of 40 or fewer words.
Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP.
Parsing NANTC using | Parsing test using | Section | |||||
22 | 23 | ||||||
parser | trained on | parser | trained on | McClosky | Zeman | McClosky | Zeman |
Stanford | PTB WSJ | 86.5 | |||||
Charniak | PTB WSJ | 90.3 | 90.5 | 89.7 | 90.3 | ||
Brown | PTB WSJ | Charniak | WSJ + 50k NANTC | 90.7 | |||
Brown | PTB WSJ | Charniak | WSJ + 250k NANTC | 90.7 | 91.0 | 90.9 | |
Brown | PTB WSJ | Charniak | WSJ + 500k NANTC | 90.9 | |||
Brown | PTB WSJ | Charniak | WSJ + 750k NANTC | 91.0 | |||
Brown | PTB WSJ | Charniak | WSJ + 1000k NANTC | 90.8 | |||
Brown | PTB WSJ | Charniak | WSJ + 1500k NANTC | 90.8 | |||
Brown | PTB WSJ | Charniak | WSJ + 2000k NANTC | 91.0 | |||
Brown | PTB WSJ | Charniak | 5 × WSJ | 84.7 | |||
Brown | PTB WSJ | Charniak | 5 × WSJ + 1750k NANTC | 87.6 | 91.0 | 87.9 | |
Brown | PTB WSJ | Charniak | 5 × WSJ + 3143k NANTC | 88.2 | 88.4 | ||
Brown | PTB WSJ | 92.4 | 91.3 | 92.0 | |||
Brown | PTB WSJ | Brown | WSJ + 50k NANTC | 92.4 | |||
Brown | PTB WSJ | Brown | WSJ + 250k NANTC | 92.3 | 92.2 | 92.3 | |
Brown | PTB WSJ | Brown | WSJ + 500k NANTC | 92.4 | |||
Brown | PTB WSJ | Brown | WSJ + 750k NANTC | 92.4 | |||
Brown | PTB WSJ | Brown | WSJ + 1000k NANTC | 92.2 | |||
Brown | PTB WSJ | Brown | WSJ + 1500k NANTC | 92.1 | |||
Brown | PTB WSJ | Brown | WSJ + 2000k NANTC | 92.0 | |||
Brown | PTB WSJ | Brown | 5 × WSJ + 1750k NANTC | 89.9 | 92.1 | 90.0 | |
Brown | PTB WSJ | Brown | 5 × WSJ + 3143k NANTC | 90.3 | 90.5 |