[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

name = Penn Treebank | version = 14 Feb 2001 | path = /fs/clip2/Corpora/Treebank-3 | owner = hwa

The Penn Treebank has become a standard data set for training and testing English parsers. There are constituent analyses of texts from various sources but the most widely used part is data from the Wall Street Journal. People usually train on sections 02-21, do various development tests and tunes on sections 00, 01, 22, and 24, and final evaluation on section 23.

Local copy of the Penn Treebank (English). You can use the script /fs/cliplab/software/utils/untokenizePenn.pl to convert tokenized surface forms to untokenized versions.


[ Back to the navigation ] [ Back to the content ]