This page describes an experiment conducted by [[User:Zeman:start|Dan Zeman]] in November and December 2006. I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson ([[http://www.cog.brown.edu/~mj/papers/naacl06-self-train.pdf|NAACL 2006, New York]]) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data. Once the original self-training experiment works as expected, we are going to use a similar scheme for [[Parser Adaptation|parser adaptation]] to a new language. =====Paths===== Note: I am going to move around some stuff, especially that in my home folder. * ''$PARSINGROOT'' - working copy of the parsers and related scripts. See [[:parsery|Parsing]] on how to create your own. * ''/fs/clip-corpora/ptb/processed'' - [[Penn Treebank]] (referred to as ''$PTB'') * ''/fs/clip-corpora/north_american_news'' - [[North American News Text Corpus]], including everything I made of it =====What do we need?===== * Small data - treebank. * Big data - raw. (But for the language adaptation task we must be able to tag it using the same tagset as the small data.) * A parser able to do N-best parsing. Charniak's or Stanford. We must be able to retrain the parser. * A reranker (Johnson's). We don't need to retrain it as far as our small data is the Penn Treebank. =====Terminology===== We have the parser and reranker (both together = reranking parser), both trained (probably) on Penn Treebank Wall Street Journal sections 2-21 (training done already at Brown). * **Charniak parser,** parser, first-stage parser or P (possibly with index) denote Charniak parser without reranker. It can return N best parses for various N, including N=1. * Reranker, second stage or R (typically without index) denote the reranker. It could return reordered N-best list, but usually we only want one best parse. * **Brown parser,** reranking parser or PR (possibly with index) denote the sequence of (P, R). It returns one best parse. * The models pretrained at Brown and delivered with the reranking parser are indexed with 0: P0, R0 (or R) and PR0. =====Agenda===== * Install the parser-reranker suite. * Build it for the right architecture, using the right optimizations. * Investigate parallelization possibilities. * Write invocation scripts. * Get the baseline. Test PR0 on PTB WSJ 22. We will use section 22 for development and section 23 for "final testing". The distinction is not very important since we only repeate someone's experiment we are not going to publish, but just in case we need to look more closely on the data, we will follow the usual procedure. * Get the big raw corpus (North American News Text, NANT). * Separate the Los Angeles Times part (from now on, this will be what we mean by NANT). * Convert SGML to plain text. * Tokenize it (''$HIEROROOT/preprocess/tokenizeE.pl - - < plainText > tokenizedText''). * Find sentence boundaries. * Clean the data, discard bad and long sentences. * Parse NANT using the pretrained reranking parser (PR0). * Retrain the first-stage parser on WSJ + the parsed NANT => get P1. * New reranking parser PR1 consists of the new first-stage parser P1 and the old second-stage reranker R. Test the new reranking parser on WSJ. =====Baseline===== We [[Parsing Evaluation|tested]] the pretrained reranking parser (PR0) on sections 22 (development) 23 (final evaluation test). All evaluations take into account only sentences of 40 or fewer words. $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.charniak $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.brown $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.charniak $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.brown | Section | 22 || 23 || | Parser | Charniak | Brown | Charniak | Brown | | Precision | 90.54 | 92.81 | 90.43 | 92.35 | | Recall | 90.43 | 91.92 | 90.21 | 91.61 | | F-score | 90.48 | 92.36 | 90.32 | 91.98 | | Tagging | 96.15 | 92.41 | 96.78 | 92.33 | | Crossing | 0.66 | 0.49 | 0.72 | 0.59 | =====Large corpus===== See [[North American News Text Corpus]] for more information on the data and its preparation. =====Parsing NANTC using P0===== See [[:Parsery|here]] for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command: cd /fs/clip-corpora/north_american_news $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz < latwp.04.clean.txt \ -o latwp.05a.brown.penn -w workdir05 -k Parsing takes about 80 CPU-hours. =====Retraining the first-stage parser===== The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (3,143,433 sentences). $PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn > ptb+latwp3000.tgz The new non-reranking parser will be called P1. The reranking parser P1+R will be called PR1. =====Parsing Penn Treebank using Charniak P1===== $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \ < $PTB/ptbwsj22.txt \ > ptbwsj22.ec.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn $PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \ < $PTB/ptbwsj23.txt \ > ptbwsj23.ec.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn | Section | 22 | 23 | | Precision | 87.74 | 88.26 | | Recall | 88.65 | 88.54 | | F-score | 88.19 | 88.40 | | Tagging | 92.67 | 92.84 | | Crossing | 0.80 | 0.91 | =====Parsing Penn Treebank using Brown PR1===== First we combine the new parser with the old reranker. $PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \ > ptb+latwp3000.brown.tgz Then we use the combined model to parse the test data. $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.br.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.br.ptb+latwp3000.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn | Section | 22 | 23 | | Precision | 90.39 | 90.68 | | Recall | 90.30 | 90.24 | | F-score | 90.34 | 90.46 | | Tagging | 93.43 | 93.65 | | Crossing | 0.61 | 0.71 | =====5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP===== McClosky et al. do not use all 3 million sentences. They found that the best results over their development data (section 22) were obtained by mixing 5 copies of the Penn Treebank Wall Street Journal sections 02-21 and (first?) 1,750,000 sentences from NANTC LATWP. head -1750000 latwp.05a.brown.penn > latwp.1750k.brown.penn Train Charniak on this mix. It takes more than 4 hours. $PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn > ptb+latwp1750.tgz Create new Brown model: combine the new parser with the old reranker. $PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \ > ptb+latwp1750.brown.tgz Parse the test sections and evaluate the results. $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.ec.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.ec.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.br.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn $PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.br.ptb+latwp1750.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn =====5 × PTB WSJ 02-21===== Train Charniak on 5 copies of PTB WSJ 02-21, without any trees from NANTC. $PARSINGROOT/charniak-parser/scripts/train.pl < ptbwsj02-21.5times.penn > 5ptb.tgz Parse sections 22 and 23 using the just trained model. $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \ < $PTB/ptbwsj22.txt \ -o ptbwsj22.ec.5ptb.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn $PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \ < $PTB/ptbwsj23.txt \ -o ptbwsj23.ec.5ptb.penn $PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn =====Summary===== The following table combines tables from previous sections and from the McClosky et al. (2006) paper. I am not sure why I do not get the same baseline numbers as McClosky et al. Possibly they evaluate all sentences rather than just those of 40 or fewer words. Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP. | Parsing NANTC using || Parsing test using || Section |||| | || || 22 || 23 || | parser | trained on | parser | trained on | McClosky | Zeman | McClosky | Zeman | | | | Stanford | PTB WSJ | | | | 86.5 | | | | Charniak | PTB WSJ | 90.3 | 90.5 | 89.7 | 90.3 | | Brown | PTB WSJ | Charniak | WSJ + 50k NANTC | 90.7 | | | | | Brown | PTB WSJ | Charniak | WSJ + 250k NANTC | 90.7 | 91.0 | | 90.9 | | Brown | PTB WSJ | Charniak | WSJ + 500k NANTC | 90.9 | | | | | Brown | PTB WSJ | Charniak | WSJ + 750k NANTC | 91.0 | | | | | Brown | PTB WSJ | Charniak | WSJ + 1000k NANTC | 90.8 | | | | | Brown | PTB WSJ | Charniak | WSJ + 1500k NANTC | 90.8 | | | | | Brown | PTB WSJ | Charniak | WSJ + 2000k NANTC | 91.0 | | | | | Brown | PTB WSJ | Charniak | 5 × WSJ | 84.7 | | | | | Brown | PTB WSJ | Charniak | 5 × WSJ + 1750k NANTC | | 87.6 | 91.0 | 87.9 | | Brown | PTB WSJ | Charniak | 5 × WSJ + 3143k NANTC | | 88.2 | | 88.4 | | | | Brown | PTB WSJ | | 92.4 | 91.3 | 92.0 | | Brown | PTB WSJ | Brown | WSJ + 50k NANTC | 92.4 | | | | | Brown | PTB WSJ | Brown | WSJ + 250k NANTC | 92.3 | 92.2 | | 92.3 | | Brown | PTB WSJ | Brown | WSJ + 500k NANTC | 92.4 | | | | | Brown | PTB WSJ | Brown | WSJ + 750k NANTC | 92.4 | | | | | Brown | PTB WSJ | Brown | WSJ + 1000k NANTC | 92.2 | | | | | Brown | PTB WSJ | Brown | WSJ + 1500k NANTC | 92.1 | | | | | Brown | PTB WSJ | Brown | WSJ + 2000k NANTC | 92.0 | | | | | Brown | PTB WSJ | Brown | 5 × WSJ + 1750k NANTC | | 89.9 | 92.1 | 90.0 | | Brown | PTB WSJ | Brown | 5 × WSJ + 3143k NANTC | | 90.3 | | 90.5 |