Table of Contents

This page describes an experiment conducted by Dan Zeman in November and December 2006.

I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson (NAACL 2006, New York) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data.

Once the original self-training experiment works as expected, we are going to use a similar scheme for parser adaptation to a new language.

Paths

Note: I am going to move around some stuff, especially that in my home folder.

What do we need?

Terminology

We have the parser and reranker (both together = reranking parser), both trained (probably) on Penn Treebank Wall Street Journal sections 2-21 (training done already at Brown).

Agenda

Baseline

We tested the pretrained reranking parser (PR0) on sections 22 (development) 23 (final evaluation test). All evaluations take into account only sentences of 40 or fewer words.

$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.charniak
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.brown
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.charniak
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.brown
Section 22 23
Parser Charniak Brown Charniak Brown
Precision 90.54 92.81 90.43 92.35
Recall 90.43 91.92 90.21 91.61
F-score 90.48 92.36 90.32 91.98
Tagging 96.15 92.41 96.78 92.33
Crossing 0.66 0.49 0.72 0.59

Large corpus

See North American News Text Corpus for more information on the data and its preparation.

Parsing NANTC using P<sub>0</sub>

See here for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command:

cd /fs/clip-corpora/north_american_news
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz < latwp.04.clean.txt \
    -o latwp.05a.brown.penn -w workdir05 -k

Parsing takes about 80 CPU-hours.

Retraining the first-stage parser

The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (3,143,433 sentences).

$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn > ptb+latwp3000.tgz

The new non-reranking parser will be called P1. The reranking parser P1+R will be called PR1.

Parsing Penn Treebank using Charniak P<sub>1</sub>

$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
    < $PTB/ptbwsj22.txt \
    > ptbwsj22.ec.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn

$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
    < $PTB/ptbwsj23.txt \
    > ptbwsj23.ec.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn
Section 22 23
Precision 87.74 88.26
Recall 88.65 88.54
F-score 88.19 88.40
Tagging 92.67 92.84
Crossing 0.80 0.91

Parsing Penn Treebank using Brown PR<sub>1</sub>

First we combine the new parser with the old reranker.

$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \
    > ptb+latwp3000.brown.tgz

Then we use the combined model to parse the test data.

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.br.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.br.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn
Section 22 23
Precision 90.39 90.68
Recall 90.30 90.24
F-score 90.34 90.46
Tagging 93.43 93.65
Crossing 0.61 0.71

5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP

McClosky et al. do not use all 3 million sentences. They found that the best results over their development data (section 22) were obtained by mixing 5 copies of the Penn Treebank Wall Street Journal sections 02-21 and (first?) 1,750,000 sentences from NANTC LATWP.

head -1750000 latwp.05a.brown.penn > latwp.1750k.brown.penn

Train Charniak on this mix. It takes more than 4 hours.

$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn > ptb+latwp1750.tgz

Create new Brown model: combine the new parser with the old reranker.

$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \
    > ptb+latwp1750.brown.tgz

Parse the test sections and evaluate the results.

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.ec.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.ec.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.br.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.br.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn

5 × PTB WSJ 02-21

Train Charniak on 5 copies of PTB WSJ 02-21, without any trees from NANTC.

$PARSINGROOT/charniak-parser/scripts/train.pl < ptbwsj02-21.5times.penn > 5ptb.tgz

Parse sections 22 and 23 using the just trained model.

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
    < $PTB/ptbwsj22.txt \
    -o ptbwsj22.ec.5ptb.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
    < $PTB/ptbwsj23.txt \
    -o ptbwsj23.ec.5ptb.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn

Summary

The following table combines tables from previous sections and from the McClosky et al. (2006) paper. I am not sure why I do not get the same baseline numbers as McClosky et al. Possibly they evaluate all sentences rather than just those of 40 or fewer words.

Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP.

Parsing NANTC using Parsing test using Section
22 23
parser trained on parser trained on McClosky Zeman McClosky Zeman
Stanford PTB WSJ 86.5
Charniak PTB WSJ 90.3 90.5 89.7 90.3
Brown PTB WSJ Charniak WSJ + 50k NANTC 90.7
Brown PTB WSJ Charniak WSJ + 250k NANTC 90.7 91.0 90.9
Brown PTB WSJ Charniak WSJ + 500k NANTC 90.9
Brown PTB WSJ Charniak WSJ + 750k NANTC 91.0
Brown PTB WSJ Charniak WSJ + 1000k NANTC 90.8
Brown PTB WSJ Charniak WSJ + 1500k NANTC 90.8
Brown PTB WSJ Charniak WSJ + 2000k NANTC 91.0
Brown PTB WSJ Charniak 5 × WSJ 84.7
Brown PTB WSJ Charniak 5 × WSJ + 1750k NANTC 87.6 91.0 87.9
Brown PTB WSJ Charniak 5 × WSJ + 3143k NANTC 88.2 88.4
Brown PTB WSJ 92.4 91.3 92.0
Brown PTB WSJ Brown WSJ + 50k NANTC 92.4
Brown PTB WSJ Brown WSJ + 250k NANTC 92.3 92.2 92.3
Brown PTB WSJ Brown WSJ + 500k NANTC 92.4
Brown PTB WSJ Brown WSJ + 750k NANTC 92.4
Brown PTB WSJ Brown WSJ + 1000k NANTC 92.2
Brown PTB WSJ Brown WSJ + 1500k NANTC 92.1
Brown PTB WSJ Brown WSJ + 2000k NANTC 92.0
Brown PTB WSJ Brown 5 × WSJ + 1750k NANTC 89.9 92.1 90.0
Brown PTB WSJ Brown 5 × WSJ + 3143k NANTC 90.3 90.5