name = North American News Text Corpus | path = /fs/clip-corpora/north_american_news | owner = zeman
The North American News Text Corpus (NANTC) is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in
/fs/clip-corpora/north_american_news
Note: In the following text, $PARSINGROOT
is the path to your working copy of the parsing SVN repository. $TOOLS
refers to $PARSINGROOT/tools
.
Corpus preprocessing
We have used the North American News Text Corpus as a source of large English text for the Self-Training Experiment. McClosky et al. (who were the first to do the experiment) say it is “24,000,000 unlabeled sentences from various news sources,” and they only use the Los Angeles Times / Washington Post part of it.
The NANTC consists of 3602 gzipped files. When gunzipped, the total size is 1.76 GB.
The LATWP part is almost 500 MB. There are 1,746,626 SGML-marked paragraphs. I converted the SGML to plain text where each line corresponds to one paragraph:
$TOOLS/nant2txt.pl latwp.sgml > latwp.txt
There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (& ENTITY ;
). One token would be better. My guess is that the entities have functions similar to punctuation, so I replace them by a single dash. Only &
is replaced by &
.
perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u
&2$; &AMP; &Cx05; &Cx06; &Cx15; &Cx17; &Cx18; &Cx1a; &Cx1b; &D0; &D1; &D2; &D3; &D4; &FS; &G; &Gov; &Gr; &HT; &Inc; &L;
&LR; &MD; &P; &P); &QC; &QL; &QR; &Reed:Growth; &Reed:Intl; &T; &TF; &TL; &T:SmallCoGrwth; &UR; &x28; &xb0; &xb1; &xb2; &xb3; &xb4; &xb5; &xb6; &xb7; &xb8; &xb9; &xba; &xbb; &xbc; &xbd; &xbe; &xc6; &xd0; &xd7; &xde; &xe6; &xf0; &xfe;
Tokenization using $HIEROROOT/preprocess/tokenizeE.pl
takes a long time. I modified the script to autoflush its output in the hope that I could parallelize it using Adam's $HIEROROOT/cluster_tools/parallelize.sh
. Unfortunately, the output had about 180,000 lines less than the input. I am running it again, non-parallel.
$TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt $TOOLS/count_words.pl < latwp.02.tok.txt
There are 90,471,928 words (token occurrences) and 442,423 word types.
Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence.
$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt
There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens.
Improved sentence delimiting (period + quote etc.):
$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt
Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40 % of words containing dashes or numbers:
$TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt
The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence.