To evaluate the parsing results (constituents, not dependencies!) by comparing them against a treebank, the standard program is [[http://nlp.cs.nyu.edu/evalb/|evalb]]. This program by Satoshi Sekine and Michael Collins is now part of our parsing repository and can be found in ''$PARSINGROOT/evalb'' (besides that, there are other copies mentioned by the [[Main.ClipResources|Resources]] page). It can be configured to use exactly the same algorithm as (Collins, 1997). Note that ''evalb'' is a C program, so # if you forgot to run ''make'' after checking out from the parsing repository, you don't have it # if you ran ''make'' on a different architecture than where you are now, you probably need to remake it Parsing papers often refer to "PARSEVAL" metrics, described in ([[http://acl.ldc.upenn.edu/H/H91/H91-1060.pdf|Black et al., 1991]]). However, there are many preprocessing steps suggested by that paper, that are currently simplified or not done at all. The parser's original purpose was to compare various grammars of English. We want to compare proposed and standard parses that are based on the same grammar (treebank). So the standard way of evaluating parsers differs from the original PARSEVAL paper, although people tend to call the evaluation PARSEVAL. The main differences between (Black et al., 1991) and Collins-configured ''evalb'' are the following: * Auxiliaries, "not", pre-infinitival "to", and possessive endings are not removed in ''evalb''. Even if we wanted to remove them, some of them would be very difficult to recognize. * Word-external punctuation shall be removed according to (Black et al., 1991). Collins was removing only the most frequent types of punctuation: '','', '':'', ''``'', ''//'', ''.''. Many others like the exclamation mark, question mark, parentheses etc. are **not** removed. On the other hand, what is removed are traces / null categories / similar stuff. * Nonterminals are not removed and thus **labeled** precision and recall is reported in parser evaluation. (A proposed constituent is considered identical to a treebank constituent if they cover the same span of the sentence and have the same label - nonterminal.) However, functional tags are removed (parts of nonterminals starting with "-" or "=" as in NP-SUBJ). ====Creating the gold standard parses==== Besides parser output, we need the gold standard trees to compare the parser output to. The documentation of ''evalb'' makes things complicated by describing only one way of obtaining the gold standard in the correct format: using the ''tgrep'' utility. Most likely the only thing that matters is that every tree occupies exactly one line (and possibly also the requirement that top-level brackets be labeled by a nonterminal) but let's take the longer way for now. ''tgrep'' is a treebank search utility supplied with the Penn Tree Bank. I don't know whether we have it but we do have ''tgrep2'', its younger brother. An older version with sources, some documentation and a //Sun Solaris// executable is in ''/fs/circle/resnik/ftp/pkg/tgrep2/Tgrep2''. A newer version without sources and documentation but compiled for the Linux x86 architecture is in ''/fs/LSE/lse_project/sw/tgrep2''. The copy of Penn Tree Bank / Wall Street Journal that we have in ''/fs/clip2/Corpora/Treebank-3/parsed/mrg/wsj'' has unlabeled top-level brackets of each tree. This is undigestible for ''tgrep2'', so the first thing we have to do is perl -pe 's/^\( /(TOP /' < /fs/clip2/Corpora/Treebank-3/parsed/mrg/wsj/23/*.mrg > ptb-wsj-23.top.mrg Then, a corpus must be preprocessed before ''tgrep2'' can search it. This is done by tgrep2 -p ptb-wsj-23.top.mrg ptb-wsj-23.t2c The authors of ''evalb'' advise to call ''tgrep -wn '/.*/' | tgrep_proc.prl'' to get the required format. The ''-n'' option is not recognized by ''tgrep2''. ''tgrep_proc.prl'' is a very simple Perl script accompanying ''evalb''; its sole purpose is to filter out blank lines, so it can be easily replaced by the inline Perl statement below: tgrep2 -w '/.*/' -c ptb-wsj-23.t2c | perl -pe '$_ = "" if(m/^\s*$/)' > ptb-wsj-23.forevalb Finally, after all the effort we have a file that could probably be created much more simply. If we work with the standard Penn Tree Bank test data, section 23, the file has 2416 trees (and lines). ====Preparing the output of the Stanford Parser for evaluation==== The Stanford Parser outputs an XML file that embeds the Penn-style tree in the element. Use ''$PARSINGROOT/stanford-parser/scripts/stanford2evalb.pl'' to extract trees in a format suitable for ''evalb'': $PARSINGROOT/stanford-parser/scripts/stanford2evalb.pl < file.stanford.xml > file.stanford.evalb ====Running evalb on the two files==== setenv EVALB $PARSINGROOT/evalb $EVALB/evalb -p $EVALB/COLLINS.prm file.gold.evalb file.stanford.evalb > evaluation.txt **Note:** If you are evaluating the output from the Charniak parser, use ''$EVALB/charniak.prm'' instead of ''$EVALB/COLLINS.prm''. It differs in that it does not evaluate the extra top-level ''S1'' nonterminals the parser outputs. ====Results==== To make sure your working copy of the parsing tools performs at least at the level we have been able to obtain before, compare to the following results. Parsers were tested on the section 23 of the English Penn Treebank. We assume that you have the gold standard data (see above on how to get it) and that you invoke ''evalb'' using the ''COLLINS.prm'' parameters, also described above. All results mentioned here are for sentences 40 words long or shorter. This is in harmony with majority of the parsing literature, although ''evalb'' computes the overall performance as well. The Stanford parser from revision 7 (equivalent to the official release 1.5.1 of 2006-06-11) with default English settings (''englishPCFG.ser.gz'') achieves **P=87.23 %, R=85.83 %,** average crossing=1.07. The Stanford parser from revision 8 (unofficial release that Chris Manning sent us on 2006-11-01) with default English settings (specifically, ''englishPCFG.ser.gz'' grammar) achieves **P=87.20 %, R=85.81 %,** average crossing=1.08. The Charniak parser with default English settings (''/DATA/EN'') achieves **F=89.89 %** (P=89.89 %, R=89.88 %, average crossing=0.77). The Brown reranking parser with default English settings achieves **F=91.98 %** (P=92.36 %, R=91.61 %, average crossing=0.6). Be careful to have the same tokenization in your gold standard parses and the parser input/output. Otherwise, ''evalb'' complains about "sentence length unmatch" or "word unmatch" and excludes the whole sentence from evaluation. Even if your tokenization is perfectly matching, you may occasionally face the length unmatch error. I observed the following instance: there was a single quote (''''') token. It was tagged as ''//'' (punctuation) in gold standard, and as ''POS'' (possessive ending) in parser output. Punctuation is excluded from standard evaluation. Unfortunately, ''evalb'' recognizes punctuation by the tag rather than the token. So, the gold standard sentence had 19 words, and the parser output had 20 words. I believe this is a bug in evalb. The parser should be penalized in tagging accuracy but the sentence evaluation should not fail because of the tagging error. =====Testing statistical significance===== Statistical significance of the differences between two parsers can be tested using Dan Bikel's [[http://www.cis.upenn.edu/~dbikel/software.html#comparator|Comparator]]. -- [[User:Zeman|Dan Zeman]] - 17 Oct 2006