Differences

This shows you the differences between two versions of the page.

--- user:zeman:north-american-news-text-corpus [2008/07/09 16:30]
zeman vytvořeno
+++ user:zeman:north-american-news-text-corpus [2008/07/09 16:33] (current)
zeman
@@ Line 6: / Line 6: @@
 The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in
-<code>
+<code>/fs/clip-corpora/north_american_news</code>
-/fs/clip-corpora/north_american_news
-</code>
 Note: In the following text, ''$PARSINGROOT'' is the path to your working copy of the parsing SVN repository. ''$TOOLS'' refers to ''$PARSINGROOT/tools''.
@@ Line 20: / Line 18: @@
 The LATWP part is almost 500&nbsp;MB. There are 1,746,626 SGML-marked paragraphs. I converted the SGML to plain text where each line corresponds to one paragraph:
-<code>
+<code>$TOOLS/nant2txt.pl latwp.sgml > latwp.txt</code>
-$TOOLS/nant2txt.pl latwp.sgml > latwp.txt
-</code>
 There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''&amp; ENTITY ;''). One token would be better. My guess is that the entities have functions similar to punctuation, so I replace them by a single dash. Only ''&amp;AMP;'' is replaced by ''&amp;''.
-<code>
+<code>perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u</code>
-perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u
-</code>
 &2$; &amp;AMP; &Cx05; &Cx06; &Cx15; &Cx17; &Cx18; &Cx1a; &Cx1b; &D0; &D1; &D2; &D3; &D4; &FS; &G; &Gov; &Gr; &HT; &Inc; &L;
@@ Line 35: / Line 29: @@
 Tokenization using ''$HIEROROOT/preprocess/tokenizeE.pl'' takes a //long time//. I modified the script to autoflush its output in the hope that I could [[ClusterExampleScripts#Automatic_Parallelization|parallelize]] it using Adam's ''$HIEROROOT/cluster_tools/parallelize.sh''. Unfortunately, the output had about 180,000 lines less than the input. I am running it again, non-parallel.
-<code>
+<code>$TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt
-$TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt
+$TOOLS/count_words.pl < latwp.02.tok.txt</code>
-$TOOLS/count_words.pl < latwp.02.tok.txt
-</code>
 There are 90,471,928 words (token occurrences) and 442,423 word types.
@@ Line 44: / Line 36: @@
 Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence.
-<code>
+<code>$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt</code>
-$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt
-</code>
 There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens.
@@ Line 52: / Line 42: @@
 Improved sentence delimiting (period + quote etc.):
-<code>
+<code>$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt</code>
-$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt
-</code>
 Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40&nbsp;% of words containing dashes or numbers:
-<code>
+<code>$TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt</code>
-$TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt
-</code>
 The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences