[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
parsery [2007/02/09 23:10]
zeman
parsery [2007/10/16 21:55] (current)
zeman N-best parsing with Charniak.
Line 77: Line 77:
 make all make all
 </code> </code>
 +
 +
 +
 +
 +==== Brown Reranking Parser ====
 +
 +Brown Reranking Parser je můj název pro kombinaci parseru Eugena Charniaka a rerankeru Marka Johnsona (oba pánové působí na Brown University). Následuje popis obslužných skriptů z marylandské wiki.
 +
 +The currently best English parser is the combination of the N-best [[ftp://ftp.cs.brown.edu/pub/nlparser/|parser]] by Eugene Charniak, and Mark Johnson's [[http://www.cog.brown.edu/~mj/Software.htm|reranker]]. After realizing that changes to Brown source code might be desirable, sources of this software were added to our SVN parsing repository.
 +
 +There are two subfolders of ''$PARSINGROOT'' related to Brown software: ''charniak-parser'' is for those who want to run the parser without the reranker, ''brown-reranking-parser'' is for those who want both. In fact, ''charniak-parser'' contains only calling scripts, while the code of both the parser and the reranker is in ''brown-reranking-parser''. Charniak parser code is in ''brown-reranking-parser/first-stage''. Johnson's reranker is in ''brown-reranking-parser/second-stage''.
 +
 +Remember that you need to go to ''$PARSINGROOT'' after checking out your working copy, and call ''make''. Note that the Makefile is currently set up to optimize the code for the C cluster machines. I am not sure whether you can //compile// it elsewhere but you definitely won't be able to //run// it elsewhere. Further note that external libraries PETSc and TAO are needed. These are not standard Linux equipment but are not part of our SVN repository either; paths to our installations are in the Makefile.
 +
 +Our SVN version of the Brown parser has some advantages over the standard distribution:
 +  * Number of output sentences equals to the number of input sentences. If parse of a sentence failed, the output line will be "<nowiki>__FAILED__</nowiki>...something" which can be easily fixed by one of our scripts. The original parser did not tell you //where// it failed, which was very difficult to fix.
 +  * The parser does not say just //Segmentation fault// when it hits the vocabulary size limit. Moreover, the limit has been pushed from 50,000 words to 1,000,000 words.
 +  * The reranker has been freed from its dependency on the Penn Treebank. The various data relations, originally wired deeply into their Makefile, are now generalized to the extent that we can call a training script and supply the training data as standard input. Not all parts of the Makefile have been generalized, yet.
 +
 +Both folders, ''charniak-parser'' and ''brown-reranking-parser'', have a ''scripts'' subfolder with the basic set of ''parse.pl'', ''cluster-parse.pl'', and ''train.pl''. These scripts are invoked in much the same fashion as for the Stanford parser (see above).
 +
 +The ''parse.pl'' and ''cluster-parse.pl'' scripts of the Charniak parser accept the ''-Nbest'' option, in addition to standard options of these scripts. ''-Nbest 50'' translates as ''-N50'' on Charniak's ''parseIt'' commandline. It asks the parser to output N (here 50) best parses, instead of just one. The output format for N>1 differs from the default: the set of parses is preceded by a line with the number of parses and the ID (number) of the sentence, and every parse is preceded by a line with the weight (log probability) of the parse. This option only applies to ''charniak-parser''. It is ignored by ''brown-reranking-parser''.
 +
 +=== Training ===
 +
 +To train only the Charniak parser, call
 +
 +<code>$PARSINGROOT/charniak-parser/scripts/train.pl < trainingdata.penn > modeldata.tgz</code>
 +
 +You will find the Charniak DATA folder tgzipped on standard output. You can pass the tgzipped file directly to the ''parse.pl'' and ''cluster-parse.pl'' scripts as //the grammar//.
 +
 +To train the parser //and// the reranker, call
 +
 +<code>$PARSINGROOT/brown-reranking-parser/scripts/train.pl < trainingdata.penn > modeldata.tgz</code>
 +
 +Again, you will get tgzipped DATA folder, but this time it will contain two reranker files in addition, ''features.gz'' and ''weights.gz''. The Brown parsing scripts require this tgzipped file as //the grammar//.
 +
 +Options:
 +  * ''-nick da-delex''
 +    * Assigns a nickname (''da-delex'' in this example) to the intermediate files the training procedure creates in the $PARSINGROOT subtree. Ensures that older models do not get overwritten. Is only needed if you want to reuse the intermediate files or if you want to train two different models in parallel. The resulting tgzipped model appears anyway on the standard output and it is your responsibility to save it.
 +  * ''-reuse''
 +    * Reuse old intermediate files, if available and up-to-date (make-wise). In other words, do not perform ''make clean'' in the beginning.
 +
 +If you have two tgzipped models and want to use the first-stage parser from the first model, plus the reranker from the second model, call
 +
 +<code>$PARSINGROOT/brown-reranking-parser/scripts/combine_brown_models.pl first.tgz second.tgz > combined.tgz</code>
  

[ Back to the navigation ] [ Back to the content ]