{{Infobox Resource | name = North American News Text Corpus | path = /fs/clip-corpora/north_american_news | owner = zeman }} The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in /fs/clip-corpora/north_american_news Note: In the following text, ''$PARSINGROOT'' is the path to your working copy of the parsing SVN repository. ''$TOOLS'' refers to ''$PARSINGROOT/tools''. =====Corpus preprocessing===== We have used the North American News Text Corpus as a source of large English text for the [[Self-Training|Self-Training Experiment]]. McClosky et al. (who were the first to do the experiment) say it is "24,000,000 unlabeled sentences from various news sources," and they only use the //Los Angeles Times / Washington Post// part of it. The NANTC consists of 3602 gzipped files. When gunzipped, the total size is 1.76 GB. The LATWP part is almost 500 MB. There are 1,746,626 SGML-marked paragraphs. I converted the SGML to plain text where each line corresponds to one paragraph: $TOOLS/nant2txt.pl latwp.sgml > latwp.txt There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''& ENTITY ;''). One token would be better. My guess is that the entities have functions similar to punctuation, so I replace them by a single dash. Only ''&AMP;'' is replaced by ''&''. perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u &2$; &AMP; &Cx05; &Cx06; &Cx15; &Cx17; &Cx18; &Cx1a; &Cx1b; &D0; &D1; &D2; &D3; &D4; &FS; &G; &Gov; &Gr; &HT; &Inc; &L; &LR; &MD; &P; &P); &QC; &QL; &QR; &Reed:Growth; &Reed:Intl; &T; &TF; &TL; &T:SmallCoGrwth; &UR; &x28; &xb0; &xb1; &xb2; &xb3; &xb4; &xb5; &xb6; &xb7; &xb8; &xb9; &xba; &xbb; &xbc; &xbd; &xbe; &xc6; &xd0; &xd7; &xde; &xe6; &xf0; &xfe; Tokenization using ''$HIEROROOT/preprocess/tokenizeE.pl'' takes a //long time//. I modified the script to autoflush its output in the hope that I could [[ClusterExampleScripts#Automatic_Parallelization|parallelize]] it using Adam's ''$HIEROROOT/cluster_tools/parallelize.sh''. Unfortunately, the output had about 180,000 lines less than the input. I am running it again, non-parallel. $TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt $TOOLS/count_words.pl < latwp.02.tok.txt There are 90,471,928 words (token occurrences) and 442,423 word types. Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence. $TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens. Improved sentence delimiting (period + quote etc.): $TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40 % of words containing dashes or numbers: $TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence.