Table of Contents
To evaluate the parsing results (constituents, not dependencies!) by comparing them against a treebank, the standard program is evalb. This program by Satoshi Sekine and Michael Collins is now part of our parsing repository and can be found in $PARSINGROOT/evalb
(besides that, there are other copies mentioned by the Resources page). It can be configured to use exactly the same algorithm as (Collins, 1997).
Note that evalb
is a C program, so
# if you forgot to run make
after checking out from the parsing repository, you don't have it
# if you ran make
on a different architecture than where you are now, you probably need to remake it
Parsing papers often refer to “PARSEVAL” metrics, described in (Black et al., 1991). However, there are many preprocessing steps suggested by that paper, that are currently simplified or not done at all. The parser's original purpose was to compare various grammars of English. We want to compare proposed and standard parses that are based on the same grammar (treebank). So the standard way of evaluating parsers differs from the original PARSEVAL paper, although people tend to call the evaluation PARSEVAL.
The main differences between (Black et al., 1991) and Collins-configured evalb
are the following:
- Auxiliaries, “not”, pre-infinitival “to”, and possessive endings are not removed in
evalb
. Even if we wanted to remove them, some of them would be very difficult to recognize. - Word-external punctuation shall be removed according to (Black et al., 1991). Collins was removing only the most frequent types of punctuation:
,
,:
,``
,//
,.
. Many others like the exclamation mark, question mark, parentheses etc. are not removed. On the other hand, what is removed are traces / null categories / similar stuff. - Nonterminals are not removed and thus labeled precision and recall is reported in parser evaluation. (A proposed constituent is considered identical to a treebank constituent if they cover the same span of the sentence and have the same label - nonterminal.) However, functional tags are removed (parts of nonterminals starting with “-” or “=” as in NP-SUBJ).
Creating the gold standard parses
Besides parser output, we need the gold standard trees to compare the parser output to. The documentation of evalb
makes things complicated by describing only one way of obtaining the gold standard in the correct format: using the tgrep
utility. Most likely the only thing that matters is that every tree occupies exactly one line (and possibly also the requirement that top-level brackets be labeled by a nonterminal) but let's take the longer way for now.
tgrep
is a treebank search utility supplied with the Penn Tree Bank. I don't know whether we have it but we do have tgrep2
, its younger brother. An older version with sources, some documentation and a Sun Solaris executable is in /fs/circle/resnik/ftp/pkg/tgrep2/Tgrep2
. A newer version without sources and documentation but compiled for the Linux x86 architecture is in /fs/LSE/lse_project/sw/tgrep2
.
The copy of Penn Tree Bank / Wall Street Journal that we have in /fs/clip2/Corpora/Treebank-3/parsed/mrg/wsj
has unlabeled top-level brackets of each tree. This is undigestible for tgrep2
, so the first thing we have to do is
perl -pe 's/^\( /(TOP /' < /fs/clip2/Corpora/Treebank-3/parsed/mrg/wsj/23/*.mrg > ptb-wsj-23.top.mrg
Then, a corpus must be preprocessed before tgrep2
can search it. This is done by
tgrep2 -p ptb-wsj-23.top.mrg ptb-wsj-23.t2c
The authors of evalb
advise to call tgrep -wn '/.*/' | tgrep_proc.prl
to get the required format. The -n
option is not recognized by tgrep2
. tgrep_proc.prl
is a very simple Perl script accompanying evalb
; its sole purpose is to filter out blank lines, so it can be easily replaced by the inline Perl statement below:
tgrep2 -w '/.*/' -c ptb-wsj-23.t2c | perl -pe '$_ = "" if(m/^\s*$/)' > ptb-wsj-23.forevalb
Finally, after all the effort we have a file that could probably be created much more simply. If we work with the standard Penn Tree Bank test data, section 23, the file has 2416 trees (and lines).
Preparing the output of the Stanford Parser for evaluation
The Stanford Parser outputs an XML file that embeds the Penn-style tree in the <tree> element. Use $PARSINGROOT/stanford-parser/scripts/stanford2evalb.pl
to extract trees in a format suitable for evalb
:
$PARSINGROOT/stanford-parser/scripts/stanford2evalb.pl < file.stanford.xml > file.stanford.evalb
Running evalb on the two files
setenv EVALB $PARSINGROOT/evalb $EVALB/evalb -p $EVALB/COLLINS.prm file.gold.evalb file.stanford.evalb > evaluation.txt
Note: If you are evaluating the output from the Charniak parser, use $EVALB/charniak.prm
instead of $EVALB/COLLINS.prm
. It differs in that it does not evaluate the extra top-level S1
nonterminals the parser outputs.
Results
To make sure your working copy of the parsing tools performs at least at the level we have been able to obtain before, compare to the following results. Parsers were tested on the section 23 of the English Penn Treebank. We assume that you have the gold standard data (see above on how to get it) and that you invoke evalb
using the COLLINS.prm
parameters, also described above. All results mentioned here are for sentences 40 words long or shorter. This is in harmony with majority of the parsing literature, although evalb
computes the overall performance as well.
The Stanford parser from revision 7 (equivalent to the official release 1.5.1 of 2006-06-11) with default English settings (englishPCFG.ser.gz
) achieves P=87.23 %, R=85.83 %, average crossing=1.07.
The Stanford parser from revision 8 (unofficial release that Chris Manning sent us on 2006-11-01) with default English settings (specifically, englishPCFG.ser.gz
grammar) achieves P=87.20 %, R=85.81 %, average crossing=1.08.
The Charniak parser with default English settings (/DATA/EN
) achieves F=89.89 % (P=89.89 %, R=89.88 %, average crossing=0.77).
The Brown reranking parser with default English settings achieves F=91.98 % (P=92.36 %, R=91.61 %, average crossing=0.6).
Be careful to have the same tokenization in your gold standard parses and the parser input/output. Otherwise, evalb
complains about “sentence length unmatch” or “word unmatch” and excludes the whole sentence from evaluation.
Even if your tokenization is perfectly matching, you may occasionally face the length unmatch error. I observed the following instance: there was a single quote (') token. It was tagged as
//
(punctuation) in gold standard, and as POS
(possessive ending) in parser output. Punctuation is excluded from standard evaluation. Unfortunately, evalb
recognizes punctuation by the tag rather than the token. So, the gold standard sentence had 19 words, and the parser output had 20 words. I believe this is a bug in evalb. The parser should be penalized in tagging accuracy but the sentence evaluation should not fail because of the tagging error.
Testing statistical significance
Statistical significance of the differences between two parsers can be tested using Dan Bikel's Comparator.
– Dan Zeman - 17 Oct 2006