user:zeman:parsing-evaluation [ufal wiki]

Creating the gold standard parses
Preparing the output of the Stanford Parser for evaluation
Running evalb on the two files
Results
Testing statistical significance

To evaluate the parsing results (constituents, not dependencies!) by comparing them against a treebank, the standard program is evalb. This program by Satoshi Sekine and Michael Collins is now part of our parsing repository and can be found in $PARSINGROOT/evalb (besides that, there are other copies mentioned by the Resources page). It can be configured to use exactly the same algorithm as (Collins, 1997).

Note that evalb is a C program, so
# if you forgot to run make after checking out from the parsing repository, you don't have it
# if you ran make on a different architecture than where you are now, you probably need to remake it

Parsing papers often refer to “PARSEVAL” metrics, described in (Black et al., 1991). However, there are many preprocessing steps suggested by that paper, that are currently simplified or not done at all. The parser's original purpose was to compare various grammars of English. We want to compare proposed and standard parses that are based on the same grammar (treebank). So the standard way of evaluating parsers differs from the original PARSEVAL paper, although people tend to call the evaluation PARSEVAL.

The main differences between (Black et al., 1991) and Collins-configured evalb are the following:

Auxiliaries, “not”, pre-infinitival “to”, and possessive endings are not removed in evalb. Even if we wanted to remove them, some of them would be very difficult to recognize.
Word-external punctuation shall be removed according to (Black et al., 1991). Collins was removing only the most frequent types of punctuation: ,, :, ``, //, .. Many others like the exclamation mark, question mark, parentheses etc. are not removed. On the other hand, what is removed are traces / null categories / similar stuff.
Nonterminals are not removed and thus labeled precision and recall is reported in parser evaluation. (A proposed constituent is considered identical to a treebank constituent if they cover the same span of the sentence and have the same label - nonterminal.) However, functional tags are removed (parts of nonterminals starting with “-” or “=” as in NP-SUBJ).

Creating the gold standard parses

Besides parser output, we need the gold standard trees to compare the parser output to. The documentation of evalb makes things complicated by describing only one way of obtaining the gold standard in the correct format: using the tgrep utility. Most likely the only thing that matters is that every tree occupies exactly one line (and possibly also the requirement that top-level brackets be labeled by a nonterminal) but let's take the longer way for now.

tgrep is a treebank search utility supplied with the Penn Tree Bank. I don't know whether we have it but we do have tgrep2, its younger brother. An older version with sources, some documentation and a Sun Solaris executable is in /fs/circle/resnik/ftp/pkg/tgrep2/Tgrep2. A newer version without sources and documentation but compiled for the Linux x86 architecture is in /fs/LSE/lse_project/sw/tgrep2.

The copy of Penn Tree Bank / Wall Street Journal that we have in /fs/clip2/Corpora/Treebank-3/parsed/mrg/wsj has unlabeled top-level brackets of each tree. This is undigestible for tgrep2, so the first thing we have to do is

perl -pe 's/^\( /(TOP /' < /fs/clip2/Corpora/Treebank-3/parsed/mrg/wsj/23/*.mrg > ptb-wsj-23.top.mrg

Then, a corpus must be preprocessed before tgrep2 can search it. This is done by

tgrep2 -p ptb-wsj-23.top.mrg ptb-wsj-23.t2c

The authors of evalb advise to call tgrep -wn '/.*/' | tgrep_proc.prl to get the required format. The -n option is not recognized by tgrep2. tgrep_proc.prl is a very simple Perl script accompanying evalb; its sole purpose is to filter out blank lines, so it can be easily replaced by the inline Perl statement below:

tgrep2 -w '/.*/' -c ptb-wsj-23.t2c | perl -pe '$_ = "" if(m/^\s*$/)' > ptb-wsj-23.forevalb

Finally, after all the effort we have a file that could probably be created much more simply. If we work with the standard Penn Tree Bank test data, section 23, the file has 2416 trees (and lines).

Preparing the output of the Stanford Parser for evaluation

The Stanford Parser outputs an XML file that embeds the Penn-style tree in the <tree> element. Use $PARSINGROOT/stanford-parser/scripts/stanford2evalb.pl to extract trees in a format suitable for evalb:

$PARSINGROOT/stanford-parser/scripts/stanford2evalb.pl < file.stanford.xml > file.stanford.evalb

Running evalb on the two files

setenv EVALB $PARSINGROOT/evalb
$EVALB/evalb -p $EVALB/COLLINS.prm file.gold.evalb file.stanford.evalb > evaluation.txt

Note: If you are evaluating the output from the Charniak parser, use $EVALB/charniak.prm instead of $EVALB/COLLINS.prm. It differs in that it does not evaluate the extra top-level S1 nonterminals the parser outputs.

Results

To make sure your working copy of the parsing tools performs at least at the level we have been able to obtain before, compare to the following results. Parsers were tested on the section 23 of the English Penn Treebank. We assume that you have the gold standard data (see above on how to get it) and that you invoke evalb using the COLLINS.prm parameters, also described above. All results mentioned here are for sentences 40 words long or shorter. This is in harmony with majority of the parsing literature, although evalb computes the overall performance as well.

The Stanford parser from revision 7 (equivalent to the official release 1.5.1 of 2006-06-11) with default English settings (englishPCFG.ser.gz) achieves P=87.23 %, R=85.83 %, average crossing=1.07.

The Stanford parser from revision 8 (unofficial release that Chris Manning sent us on 2006-11-01) with default English settings (specifically, englishPCFG.ser.gz grammar) achieves P=87.20 %, R=85.81 %, average crossing=1.08.

The Charniak parser with default English settings (/DATA/EN) achieves F=89.89 % (P=89.89 %, R=89.88 %, average crossing=0.77).

The Brown reranking parser with default English settings achieves F=91.98 % (P=92.36 %, R=91.61 %, average crossing=0.6).

Be careful to have the same tokenization in your gold standard parses and the parser input/output. Otherwise, evalb complains about “sentence length unmatch” or “word unmatch” and excludes the whole sentence from evaluation.

Even if your tokenization is perfectly matching, you may occasionally face the length unmatch error. I observed the following instance: there was a single quote (') token. It was tagged as // (punctuation) in gold standard, and as POS (possessive ending) in parser output. Punctuation is excluded from standard evaluation. Unfortunately, evalb recognizes punctuation by the tag rather than the token. So, the gold standard sentence had 19 words, and the parser output had 20 words. I believe this is a bug in evalb. The parser should be penalized in tagging accuracy but the sentence evaluation should not fail because of the tagging error.

Testing statistical significance

Statistical significance of the differences between two parsers can be tested using Dan Bikel's Comparator.

– Dan Zeman - 17 Oct 2006

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Creating the gold standard parses

Preparing the output of the Stanford Parser for evaluation

Running evalb on the two files

Results

Testing statistical significance