user:zeman:self-training [ufal wiki]

This is an old revision of the document!

Paths
What do we need?
Terminology
Agenda
Baseline
Large corpus
Parsing NANTC using P0
Retraining the first-stage parser
Parsing Penn Treebank using Charniak P1
Parsing Penn Treebank using Brown PR1
5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP
5 × PTB WSJ 02-21
Summary

This page describes an experiment conducted by Dan Zeman in November and December 2006.

I am trying to repeat the experiment of David McClosky, Eugene Charniak, and Mark Johnson ([http://www.cog.brown.edu/~mj/papers/naacl06-self-train.pdf NAACL 2006, New York]) with self-training a parser. The idea is that you train a parser on small data, run it over big data, re-train it on its own output for the big data, and have a better-performing parser. The folks at Brown University used Charniak's reranking parser, i.e. a parser-reranker sequence. The big data was parsed by the whole reranking parser but only the first-stage parser was retrained on it. The reranker only saw the small data.

Once the original self-training experiment works as expected, we are going to use a similar scheme for parser adaptation to a new language.

Paths

Note: I am going to move around some stuff, especially that in my home folder.

$PARSINGROOT - working copy of the parsers and related scripts. See Parsing on how to create your own.
/fs/clip-corpora/ptb/processed - Penn Treebank (referred to as $PTB)
/fs/clip-corpora/north_american_news - North American News Text Corpus, including everything I made of it

What do we need?

Small data - treebank.
Big data - raw. (But for the language adaptation task we must be able to tag it using the same tagset as the small data.)
A parser able to do N-best parsing. Charniak's or Stanford. We must be able to retrain the parser.
A reranker (Johnson's). We don't need to retrain it as far as our small data is the Penn Treebank.

Terminology

We have the parser and reranker (both together = reranking parser), both trained (probably) on Penn Treebank Wall Street Journal sections 2-21 (training done already at Brown).

Charniak parser, parser, first-stage parser or P (possibly with index) denote Charniak parser without reranker. It can return N best parses for various N, including N=1.
Reranker, second stage or R (typically without index) denote the reranker. It could return reordered N-best list, but usually we only want one best parse.
Brown parser, reranking parser or PR (possibly with index) denote the sequence of (P, R). It returns one best parse.
The models pretrained at Brown and delivered with the reranking parser are indexed with 0: P₀, R₀ (or R) and PR₀.

Agenda

Install the parser-reranker suite.
- Build it for the right architecture, using the right optimizations.
- Investigate parallelization possibilities.
- Write invocation scripts.
Get the baseline. Test PR₀ on PTB WSJ 22. We will use section 22 for development and section 23 for “final testing”. The distinction is not very important since we only repeate someone's experiment we are not going to publish, but just in case we need to look more closely on the data, we will follow the usual procedure.
Get the big raw corpus (North American News Text, NANT).
Separate the Los Angeles Times part (from now on, this will be what we mean by NANT).
Convert SGML to plain text.
Tokenize it ($HIEROROOT/preprocess/tokenizeE.pl - - < plainText > tokenizedText).
Find sentence boundaries.
Clean the data, discard bad and long sentences.
Parse NANT using the pretrained reranking parser (PR₀).
Retrain the first-stage parser on WSJ + the parsed NANT ⇒ get P₁.
New reranking parser PR₁ consists of the new first-stage parser P₁ and the old second-stage reranker R. Test the new reranking parser on WSJ.

Baseline

We tested the pretrained reranking parser (PR₀) on sections 22 (development) 23 (final evaluation test). All evaluations take into account only sentences of 40 or fewer words.

$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.charniak
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn $PTB/ptbwsj22.brown
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.charniak
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn $PTB/ptbwsj23.brown

Section

colspan=2

Parser

Charniak

Brown

Charniak

Brown

Precision

90.54

92.81

90.43

92.35

Recall

90.43

91.92

90.21

91.61

F-score

90.48

92.36

90.32

91.98

Tagging

96.15

92.41

96.78

92.33

Crossing

0.66

0.49

0.72

0.59

Large corpus

See North American News Text Corpus for more information on the data and its preparation.

Parsing NANTC using P0

See here for more information on the Brown Reranking Parser. We parsed the LATWP part of NANTC on the C cluster using the following command:

cd /fs/clip-corpora/north_american_news
$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -l en -g wsj.tgz &lt; latwp.04.clean.txt \
    -o latwp.05a.brown.penn -w workdir05 -k

Parsing takes about 80 CPU-hours.

Retraining the first-stage parser

The following command trains the Charniak parser on 5 copies of the sections 02-21 of the Penn Treebank Wall Street Journal, and 1 copy of the parsed part of NANTC (3,143,433 sentences).

$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.05a.brown.penn &gt; ptb+latwp3000.tgz

The new non-reranking parser will be called P₁. The reranking parser P₁+R will be called PR₁.

Parsing Penn Treebank using Charniak P1

$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
    &lt; $PTB/ptbwsj22.txt \
    &gt; ptbwsj22.ec.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp3000.penn

$PARSINGROOT/charniak-parser/scripts/parse.pl -g ptb+latwp3000.tgz \
    &lt; $PTB/ptbwsj23.txt \
    &gt; ptbwsj23.ec.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp3000.penn

Section

Precision

87.74

88.26

Recall

88.65

88.54

F-score

88.19

88.40

Tagging

92.67

92.84

Crossing

0.80

0.91

Parsing Penn Treebank using Brown PR1

First we combine the new parser with the old reranker.

$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp3000.tgz wsj.tgz \
    &gt; ptb+latwp3000.brown.tgz

Then we use the combined model to parse the test data.

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
    &lt; $PTB/ptbwsj22.txt \
    -o ptbwsj22.br.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp3000.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp3000.brown.tgz \
    &lt; $PTB/ptbwsj23.txt \
    -o ptbwsj23.br.ptb+latwp3000.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp3000.penn

Section

Precision

90.39

90.68

Recall

90.30

90.24

F-score

90.34

90.46

Tagging

93.43

93.65

Crossing

0.61

0.71

5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP

McClosky et al. do not use all 3 million sentences. They found that the best results over their development data (section 22) were obtained by mixing 5 copies of the Penn Treebank Wall Street Journal sections 02-21 and (first?) 1,750,000 sentences from NANTC LATWP.

head -1750000 latwp.05a.brown.penn &gt; latwp.1750k.brown.penn

Train Charniak on this mix. It takes more than 4 hours.

$PARSINGROOT/charniak-parser/scripts/train.pl ptbwsj02-21.5times.penn latwp.1750k.brown.penn &gt; ptb+latwp1750.tgz

Create new Brown model: combine the new parser with the old reranker.

$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl ptb+latwp1750.tgz wsj.tgz \
    &gt; ptb+latwp1750.brown.tgz

Parse the test sections and evaluate the results.

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
    &lt; $PTB/ptbwsj22.txt \
    -o ptbwsj22.ec.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.ptb+latwp1750.penn

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g ptb+latwp1750.tgz \
    &lt; $PTB/ptbwsj23.txt \
    -o ptbwsj23.ec.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.ptb+latwp1750.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
    &lt; $PTB/ptbwsj22.txt \
    -o ptbwsj22.br.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.br.ptb+latwp1750.penn

$PARSINGROOT/brown-reranking-parser/scripts/cluster-parse.pl -g ptb+latwp1750.brown.tgz \
    &lt; $PTB/ptbwsj23.txt \
    -o ptbwsj23.br.ptb+latwp1750.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.br.ptb+latwp1750.penn

5 × PTB WSJ 02-21

Train Charniak on 5 copies of PTB WSJ 02-21, without any trees from NANTC.

$PARSINGROOT/charniak-parser/scripts/train.pl &lt; ptbwsj02-21.5times.penn &gt; 5ptb.tgz

Parse sections 22 and 23 using the just trained model.

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
    &lt; $PTB/ptbwsj22.txt \
    -o ptbwsj22.ec.5ptb.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj22.penn ptbwsj22.ec.5ptb.penn

$PARSINGROOT/charniak-parser/scripts/cluster-parse.pl -g 5ptb.tgz \
    &lt; $PTB/ptbwsj23.txt \
    -o ptbwsj23.ec.5ptb.penn
$PARSINGROOT/evalb/evalb -p $PARSINGROOT/evalb/charniak.prm $PTB/ptbwsj23.penn ptbwsj23.ec.5ptb.penn

Summary

The following table combines tables from previous sections and from the McClosky et al. (2006) paper. I am not sure why I do not get the same baseline numbers as McClosky et al. Possibly they evaluate all sentences rather than just those of 40 or fewer words.

Remember that Charniak parser means without reranker, Brown parser means with reranker. PTB WSJ (or WSJ) in training means sections 02-21. 50k NANTC means 50,000 sentences of NANTC LATWP.

colspan=2 rowspan=2

Parsing NANTC using

colspan=2 rowspan=2

Parsing test using

colspan=4 align=center

colspan=2 align=center

parser

trained on

parser

trained on

McClosky

Zeman

McClosky

Zeman

Stanford

PTB WSJ

align=center

Charniak

PTB WSJ

align=center

90.3

align=center

90.5

align=center

89.7

align=center

Brown

PTB WSJ

Charniak

WSJ + 50k NANTC

align=center

Brown

PTB WSJ

Charniak

WSJ + 250k NANTC

align=center

90.7

align=center

91.0

align=center

Brown

PTB WSJ

Charniak

WSJ + 500k NANTC

align=center

Brown

PTB WSJ

Charniak

WSJ + 750k NANTC

align=center

Brown

PTB WSJ

Charniak

WSJ + 1000k NANTC

align=center

Brown

PTB WSJ

Charniak

WSJ + 1500k NANTC

align=center

Brown

PTB WSJ

Charniak

WSJ + 2000k NANTC

align=center

Brown

PTB WSJ

Charniak

5 × WSJ

align=center

Brown

PTB WSJ

Charniak

5 × WSJ + 1750k NANTC

align=center

87.6

align=center

91.0

align=center

Brown

PTB WSJ

Charniak

5 × WSJ + 3143k NANTC

align=center

88.2

align=center

Brown

PTB WSJ

align=center

92.4

align=center

91.3

align=center

Brown

PTB WSJ

Brown

WSJ + 50k NANTC

align=center

Brown

PTB WSJ

Brown

WSJ + 250k NANTC

align=center

92.3

align=center

92.2

align=center

Brown

PTB WSJ

Brown

WSJ + 500k NANTC

align=center

Brown

PTB WSJ

Brown

WSJ + 750k NANTC

align=center

Brown

PTB WSJ

Brown

WSJ + 1000k NANTC

align=center

Brown

PTB WSJ

Brown

WSJ + 1500k NANTC

align=center

Brown

PTB WSJ

Brown

WSJ + 2000k NANTC

align=center

Brown

PTB WSJ

Brown

5 × WSJ + 1750k NANTC

align=center

89.9

align=center bgcolor=yellow

92.1

align=center

Brown

PTB WSJ

Brown

5 × WSJ + 3143k NANTC

align=center

90.3

align=center

Experiments
English
Parsing

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Paths

What do we need?

Terminology

Agenda

Baseline

Large corpus

Parsing NANTC using P<sub>0</sub>

Retraining the first-stage parser

Parsing Penn Treebank using Charniak P<sub>1</sub>

Parsing Penn Treebank using Brown PR<sub>1</sub>

5 × PTB WSJ 02-21 + 1,750,000 sentences from LATWP

5 × PTB WSJ 02-21

Summary