user:zeman:self-training

Paths
What do we need?
Terminology
Agenda
Baseline
Large corpus
Parsing NANTC using P0
Retraining the first-stage parser
Parsing Penn Treebank using Charniak P1
Parsing Penn Treebank using Brown PR1
5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP
5 × PTB WSJ 02-21
Summary

This page describes an experiment conducted by Dan Zeman in November and December 2006.

I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson (NAACL 2006, New York) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data.

Once the original self-training experiment works as expected, we are going to use a similar scheme for parser adaptation to a new language.

Paths

Note: I am going to move around some stuff, especially that in my home folder.

$PARSINGROOT - working copy of the parsers and related scripts. See Parsing on how to create your own.
/fs/clip-corpora/ptb/processed - Penn Treebank (referred to as $PTB)
/fs/clip-corpora/north_american_news - North American News Text Corpus, including everything I made of it

What do we need?

Small data - treebank.
Big data - raw. (But for the language adaptation task we must be able to tag it using the same tagset as the small data.)
A parser able to do N-best parsing. Charniak's or Stanford. We must be able to retrain the parser.
A reranker (Johnson's). We don't need to retrain it as far as our small data is the Penn Treebank.

Terminology

We have the parser and reranker (both together = reranking parser), both trained (probably) on Penn Treebank Wall Street Journal sections 2-21 (training done already at Brown).

Charniak parser, parser, first-stage parser or P (possibly with index) denote Charniak parser without reranker. It can return N best parses for various N, including N=1.
Reranker, second stage or R (typically without index) denote the reranker. It could return reordered N-best list, but usually we only want one best parse.
Brown parser, reranking parser or PR (possibly with index) denote the sequence of (P, R). It returns one best parse.
The models pretrained at Brown and delivered with the reranking parser are indexed with 0: P₀, R₀ (or R) and PR₀.

Agenda

Install the parser-reranker suite.
- Build it for the right architecture, using the right optimizations.
- Investigate parallelization possibilities.
- Write invocation scripts.
Get the baseline. Test PR₀ on PTB WSJ 22. We will use section 22 for development and section 23 for “final testing”. The distinction is not very important since we only repeate someone's experiment we are not going to publish, but just in case we need to look more closely on the data, we will follow the usual procedure.
Get the big raw corpus (North American News Text, NANT).
Separate the Los Angeles Times part (from now on, this will be what we mean by NANT).
Convert SGML to plain text.
Tokenize it ($HIEROROOT/preprocess/tokenizeE.pl - - < plainText > tokenizedText).
Find sentence boundaries.
Clean the data, discard bad and long sentences.
Parse NANT using the pretrained reranking parser (PR₀).
Retrain the first-stage parser on WSJ + the parsed NANT ⇒ get P₁.
New reranking parser PR₁ consists of the new first-stage parser P₁ and the old second-stage reranker R. Test the new reranking parser on WSJ.

Baseline

We tested the pretrained reranking parser (PR₀) on sections 22 (development) 23 (final evaluation test). All evaluations take into account only sentences of 40 or fewer words.

$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.charniak
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.brown
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.charniak
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.brown

Section	22		23
Parser	Charniak	Brown	Charniak	Brown
Precision	90.54	92.81	90.43	92.35
Recall	90.43	91.92	90.21	91.61
F-score	90.48	92.36	90.32	91.98
Tagging	96.15	92.41	96.78	92.33
Crossing	0.66	0.49	0.72	0.59

Large corpus

See North American News Text Corpus for more information on the data and its preparation.

Parsing NANTC using P0

See here for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command:

cd /fs/clip-corpora/north_american_news
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz < latwp.04.clean.txt \
    -o latwp.05a.brown.penn -w workdir05 -k

Parsing takes about 80 CPU-hours.

Retraining the first-stage parser

The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (3,143,433 sentences).

$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn > ptb+latwp3000.tgz

The new non-reranking parser will be called P₁. The reranking parser P₁+R will be called PR₁.

Parsing Penn Treebank using Charniak P1

$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
    < $PTB/ptbwsj22.txt \
    > ptbwsj22.ec.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn

$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
    < $PTB/ptbwsj23.txt \
    > ptbwsj23.ec.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn

Section	22	23
Precision	87.74	88.26
Recall	88.65	88.54
F-score	88.19	88.40
Tagging	92.67	92.84
Crossing	0.80	0.91

Parsing Penn Treebank using Brown PR1

First we combine the new parser with the old reranker.

$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \
    > ptb+latwp3000.brown.tgz

Then we use the combined model to parse the test data.

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.br.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.br.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn

Section	22	23
Precision	90.39	90.68
Recall	90.30	90.24
F-score	90.34	90.46
Tagging	93.43	93.65
Crossing	0.61	0.71

5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP

McClosky et al. do not use all 3 million sentences. They found that the best results over their development data (section 22) were obtained by mixing 5 copies of the Penn Treebank Wall Street Journal sections 02-21 and (first?) 1,750,000 sentences from NANTC LATWP.

head -1750000 latwp.05a.brown.penn > latwp.1750k.brown.penn

Train Charniak on this mix. It takes more than 4 hours.

$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn > ptb+latwp1750.tgz

Create new Brown model: combine the new parser with the old reranker.

$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \
    > ptb+latwp1750.brown.tgz

Parse the test sections and evaluate the results.

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.ec.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.ec.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.br.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.br.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn

5 × PTB WSJ 02-21

Train Charniak on 5 copies of PTB WSJ 02-21, without any trees from NANTC.

$PARSINGROOT/charniak-parser/scripts/train.pl < ptbwsj02-21.5times.penn > 5ptb.tgz

Parse sections 22 and 23 using the just trained model.

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.ec.5ptb.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.ec.5ptb.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn

Summary

The following table combines tables from previous sections and from the McClosky et al. (2006) paper. I am not sure why I do not get the same baseline numbers as McClosky et al. Possibly they evaluate all sentences rather than just those of 40 or fewer words.

Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP.

Parsing NANTC using		Parsing test using		Section
				22		23
parser	trained on	parser	trained on	McClosky	Zeman	McClosky	Zeman
		Stanford	PTB WSJ				86.5
		Charniak	PTB WSJ	90.3	90.5	89.7	90.3
Brown	PTB WSJ	Charniak	WSJ + 50k NANTC	90.7
Brown	PTB WSJ	Charniak	WSJ + 250k NANTC	90.7	91.0		90.9
Brown	PTB WSJ	Charniak	WSJ + 500k NANTC	90.9
Brown	PTB WSJ	Charniak	WSJ + 750k NANTC	91.0
Brown	PTB WSJ	Charniak	WSJ + 1000k NANTC	90.8
Brown	PTB WSJ	Charniak	WSJ + 1500k NANTC	90.8
Brown	PTB WSJ	Charniak	WSJ + 2000k NANTC	91.0
Brown	PTB WSJ	Charniak	5 × WSJ	84.7
Brown	PTB WSJ	Charniak	5 × WSJ + 1750k NANTC		87.6	91.0	87.9
Brown	PTB WSJ	Charniak	5 × WSJ + 3143k NANTC		88.2		88.4
		Brown	PTB WSJ		92.4	91.3	92.0
Brown	PTB WSJ	Brown	WSJ + 50k NANTC	92.4
Brown	PTB WSJ	Brown	WSJ + 250k NANTC	92.3	92.2		92.3
Brown	PTB WSJ	Brown	WSJ + 500k NANTC	92.4
Brown	PTB WSJ	Brown	WSJ + 750k NANTC	92.4
Brown	PTB WSJ	Brown	WSJ + 1000k NANTC	92.2
Brown	PTB WSJ	Brown	WSJ + 1500k NANTC	92.1
Brown	PTB WSJ	Brown	WSJ + 2000k NANTC	92.0
Brown	PTB WSJ	Brown	5 × WSJ + 1750k NANTC		89.9	92.1	90.0
Brown	PTB WSJ	Brown	5 × WSJ + 3143k NANTC		90.3		90.5

Table of Contents