Differences

This shows you the differences between two versions of the page.

--- user:zeman:joshua [2010/03/07 22:37]
zeman Troubleshooter.
+++ user:zeman:joshua [2010/03/08 15:46]
zeman Long sentences are problem.
@@ Line 307: / Line 307: @@
 ===== Troubleshooter =====
 ==== Grammar extraction: Negative array size ====
 If you encounter this exception during corpus binarization or (in older releases of Joshua) during grammar extraction, check your alignment file whether it matches your source and target corpus. Did you switch translation direction accidentially? The alignment file must have the same number of lines as your source and target corpus, one line per sentence (segment) pair. The "tokens" on each line are pairs of numbers, such as "0-0 1-2 2-2 3-5". The first number in each pair is the index to the source sentence (first token has index 0) and the second number is index to the target sentence. By switching the source and the target, you are likely to cause some indices to point out of the sentence, and you are in trouble.
+Another source of this error could be sentences with 100 or more words. This is not a strict limit, often I was able to extract grammars for corpora unchecked for such sentences, but according to Lance Schwartz, long sentences can cause problems. And after all, they are suspicious anyway, and their contribution to the learnt model is doubtful.
 ==== ZMERT: corrupted temp file ====

Institute of Formal and Applied Linguistics Wiki