Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
user:zeman:north-american-news-text-corpus [2008/07/09 16:30] zeman vytvořeno |
user:zeman:north-american-news-text-corpus [2008/07/09 16:33] (current) zeman |
||
---|---|---|---|
Line 6: | Line 6: | ||
The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in | The **North American News Text Corpus (NANTC)** is a large collection of English newswire text, published by the Linguistic Data Consortium (Graff, 1995; LDC95T21). Our copy resides in | ||
- | < | + | < |
- | / | + | |
- | </ | + | |
Note: In the following text, '' | Note: In the following text, '' | ||
Line 20: | Line 18: | ||
The LATWP part is almost 500& | The LATWP part is almost 500& | ||
- | < | + | < |
- | $TOOLS/ | + | |
- | </ | + | |
There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''& | There is a number of SGML entities. I do not have their definitions and thus cannot convert them to meaningful characters. But I want to get rid of them in a way that hurts parsing the least. Tokenization would split them into three tokens (''& | ||
- | < | + | < |
- | perl -pe ' | + | |
- | </ | + | |
&2$; & | &2$; & | ||
Line 35: | Line 29: | ||
Tokenization using '' | Tokenization using '' | ||
- | < | + | < |
- | $TOOLS/ | + | $TOOLS/ |
- | $TOOLS/ | + | |
- | </ | + | |
There are 90,471,928 words (token occurrences) and 442,423 word types. | There are 90,471,928 words (token occurrences) and 442,423 word types. | ||
Line 44: | Line 36: | ||
Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence. | Sentence boundaries are not tagged. We used a simple rule-based script to find them. Every line in its output contains exactly one sentence. | ||
- | < | + | < |
- | $TOOLS/ | + | |
- | </ | + | |
There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens. | There are 3,133,554 sentences. Average number of tokens per sentence is 29 (higher than in the Penn Treebank Wall Street Journal data). There are some very long sentences: 18,222 sentences have more than 100 tokens. The longest sentence has 450 tokens. | ||
Line 52: | Line 42: | ||
Improved sentence delimiting (period + quote etc.): | Improved sentence delimiting (period + quote etc.): | ||
- | < | + | < |
- | $TOOLS/ | + | |
- | </ | + | |
Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40& | Now there are 3,722,125 sentences. 2,527 sentences have more than 100 tokens and 2 sentences have more than 400 tokens. We discard all sentences longer than 40 tokens and sentences containing more than 40& | ||
- | < | + | < |
- | $TOOLS/ | + | |
- | </ | + | |
The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence. | The new corpus contains 61,260,818 words, 273,704 word types, and 3,143,433 sentences. The longest sentence has 40 tokens, average is 19 tokens per sentence. | ||