[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
user:zeman:north-american-news-text-corpus [2008/07/09 16:30]
zeman vytvořeno
user:zeman:north-american-news-text-corpus [2008/07/09 16:33] (current)
zeman
Line 6: Line 6:
 The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in
  
-<code> +<code>/fs/clip-corpora/north_american_news</code>
-/fs/clip-corpora/north_american_news +
-</code>+
  
 Note: In the following text, ''$PARSINGROOT'' is the path to your working copy of the parsing SVN repository. ''$TOOLS'' refers to ''$PARSINGROOT/tools''. Note: In the following text, ''$PARSINGROOT'' is the path to your working copy of the parsing SVN repository. ''$TOOLS'' refers to ''$PARSINGROOT/tools''.
Line 20: Line 18:
 The LATWP part is almost 500&nbsp;MB. There are 1,746,626 SGML-marked paragraphs. I converted the SGML to plain text where each line corresponds to one paragraph: The LATWP part is almost 500&nbsp;MB. There are 1,746,626 SGML-marked paragraphs. I converted the SGML to plain text where each line corresponds to one paragraph:
  
-<code> +<code>$TOOLS/nant2txt.pl latwp.sgml > latwp.txt</code>
-$TOOLS/nant2txt.pl latwp.sgml > latwp.txt +
-</code>+
  
 There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''&amp; ENTITY ;''). One token would be better. My guess is that the entities have functions similar to punctuation, so I replace them by a single dash. Only ''&amp;AMP;'' is replaced by ''&amp;''. There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''&amp; ENTITY ;''). One token would be better. My guess is that the entities have functions similar to punctuation, so I replace them by a single dash. Only ''&amp;AMP;'' is replaced by ''&amp;''.
  
-<code> +<code>perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u</code>
-perl -pe 'while(s/(&\S*?;)//) { print "$1\n" } $_ = ""' < north_american_news_text.sgml | sort -u +
-</code>+
  
 &2$; &amp;AMP; &Cx05; &Cx06; &Cx15; &Cx17; &Cx18; &Cx1a; &Cx1b; &D0; &D1; &D2; &D3; &D4; &FS; &G; &Gov; &Gr; &HT; &Inc; &L; &2$; &amp;AMP; &Cx05; &Cx06; &Cx15; &Cx17; &Cx18; &Cx1a; &Cx1b; &D0; &D1; &D2; &D3; &D4; &FS; &G; &Gov; &Gr; &HT; &Inc; &L;
Line 35: Line 29:
 Tokenization using ''$HIEROROOT/preprocess/tokenizeE.pl'' takes a //long time//. I modified the script to autoflush its output in the hope that I could [[ClusterExampleScripts#Automatic_Parallelization|parallelize]] it using Adam's ''$HIEROROOT/cluster_tools/parallelize.sh''. Unfortunately, the output had about 180,000 lines less than the input. I am running it again, non-parallel. Tokenization using ''$HIEROROOT/preprocess/tokenizeE.pl'' takes a //long time//. I modified the script to autoflush its output in the hope that I could [[ClusterExampleScripts#Automatic_Parallelization|parallelize]] it using Adam's ''$HIEROROOT/cluster_tools/parallelize.sh''. Unfortunately, the output had about 180,000 lines less than the input. I am running it again, non-parallel.
  
-<code> +<code>$TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt 
-$TOOLS/tokenizeE.pl - - < latwp.txt > latwp.02.tok.txt +$TOOLS/count_words.pl < latwp.02.tok.txt</code>
-$TOOLS/count_words.pl < latwp.02.tok.txt +
-</code>+
  
 There are 90,471,928 words (token occurrences) and 442,423 word types. There are 90,471,928 words (token occurrences) and 442,423 word types.
Line 44: Line 36:
 Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence. Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence.
  
-<code> +<code>$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt</code>
-$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03.sent.txt +
-</code>+
  
 There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens. There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens.
Line 52: Line 42:
 Improved sentence delimiting (period + quote etc.): Improved sentence delimiting (period + quote etc.):
  
-<code> +<code>$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt</code>
-$TOOLS/find_sentences.pl < latwp.02.tok.txt > latwp.03b.sent.txt +
-</code>+
  
 Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40&nbsp;% of words containing dashes or numbers: Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40&nbsp;% of words containing dashes or numbers:
  
-<code> +<code>$TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt</code>
-$TOOLS/discard_long_bad_sentences.pl < latwp.03b.sent.txt > latwp.04.clean.txt +
-</code>+
  
 The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence. The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence.
  

[ Back to the navigation ] [ Back to the content ]