{{Infobox Resource | name = North American News Text Corpus
| path = /fs/clip-corpora/north_american_news
| owner = zeman
}}
The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in
/fs/clip-corpora/north_american_news
Note: In the following text, ''$PARSINGROOT'' is the path to your working copy of the parsing SVN repository. ''$TOOLS'' refers to ''$PARSINGROOT/tools''.
=====Corpus preprocessing=====
We have used the North American News Text Corpus as a source of large English text for the [[Self-Training|Self-Training Experiment]]. McClosky et al. (who were the first to do the experiment) say it is "24,000,000 unlabeled sentences from various news sources," and they only use the //Los Angeles Times / Washington Post// part of it.
The NANTC consists of 3602 gzipped files. When gunzipped, the total size is 1.76 GB.
The LATWP part is almost 500 MB. There are 1,746,626 SGML-marked paragraphs. I converted the SGML to plain text where each line corresponds to one paragraph:
$TOOLS/nant2txt.pl latwp.sgml > latwp.txt
There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''& ENTITY ;''). One token would be better. My guess is that the entities have functions similar to punctuation, so I replace them by a single dash. Only ''&'' is replaced by ''&''.
perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u
&2$; & &Cx05; &Cx06; &Cx15; &Cx17; &Cx18; &Cx1a; &Cx1b; &D0; &D1; &D2; &D3; &D4; &FS; &G; &Gov; &Gr; &HT; &Inc; &L;
&LR; &MD; &P; &P); &QC; &QL; &QR; &Reed:Growth; &Reed:Intl; &T; &TF; &TL; &T:SmallCoGrwth; &UR; &x28; &xb0; &xb1; &xb2; &xb3; &xb4; &xb5; &xb6; &xb7; &xb8; &xb9; &xba; &xbb; &xbc; &xbd; &xbe; &xc6; &xd0; &xd7; &xde; &xe6; &xf0; &xfe;
Tokenization using ''$HIEROROOT/preprocess/tokenizeE.pl'' takes a //long time//. I modified the script to autoflush its output in the hope that I could [[ClusterExampleScripts#Automatic_Parallelization|parallelize]] it using Adam's ''$HIEROROOT/cluster_tools/parallelize.sh''. Unfortunately, the output had about 180,000 lines less than the input. I am running it again, non-parallel.
$TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt
$TOOLS/count_words.pl < latwp.02.tok.txt
There are 90,471,928 words (token occurrences) and 442,423 word types.
Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence.
$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt
There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens.
Improved sentence delimiting (period + quote etc.):
$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt
Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40 % of words containing dashes or numbers:
$TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt
The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence.