Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision | ||
user:zeman:joshua [2009/06/10 00:05] zeman Nesoulad mezi zdrojáky a dokumentací z svn a release balíkem. |
user:zeman:joshua [2010/08/31 15:55] (current) zeman Překlep. |
||
---|---|---|---|
Line 9: | Line 9: | ||
* http:// | * http:// | ||
* http:// | * http:// | ||
+ | * http:// | ||
===== Instalace ===== | ===== Instalace ===== | ||
Line 95: | Line 95: | ||
* Tokenizovaný a segmentovaný text v cílovém jazyce (hi). | * Tokenizovaný a segmentovaný text v cílovém jazyce (hi). | ||
* Párování vyrobíme [[Giza++|Gizou++]]. Správný soubor s párováním má stejný počet řádků jako zdrojový a cílový text (co řádek, to věta), ale místo slov má na každém řádku posloupnost dvojic čísel (např. "2-0 2-1 2-2 2-3 1-4"). Čísla vyjadřují, | * Párování vyrobíme [[Giza++|Gizou++]]. Správný soubor s párováním má stejný počet řádků jako zdrojový a cílový text (co řádek, to věta), ale místo slov má na každém řádku posloupnost dvojic čísel (např. "2-0 2-1 2-2 2-3 1-4"). Čísla vyjadřují, | ||
+ | |||
+ | ==== Zastaralý návod ==== | ||
+ | |||
+ | **Pozor, následující poznámky pocházejí z& | ||
A takhle pustíme Joshuu, aby z trénovacích dat extrahoval gramatiku. Joshua z nějakého důvodu vyžaduje také testovací soubor se zdrojovým jazykem. Soudě podle příkladu, který dodali, stačí zkopírovat první větu ze zdrojových trénovacích dat. Gramatiku je pak ještě třeba seřadit, vyházet duplicitní pravidla a zagzipovat. | A takhle pustíme Joshuu, aby z trénovacích dat extrahoval gramatiku. Joshua z nějakého důvodu vyžaduje také testovací soubor se zdrojovým jazykem. Soudě podle příkladu, který dodali, stačí zkopírovat první větu ze zdrojových trénovacích dat. Gramatiku je pak ještě třeba seřadit, vyházet duplicitní pravidla a zagzipovat. | ||
Line 142: | Line 146: | ||
--output=model/ | --output=model/ | ||
--maxPhraseLength=5</ | --maxPhraseLength=5</ | ||
+ | |||
+ | ==== Nový návod pro Joshuu 1.3 ==== | ||
+ | |||
+ | Následuje výtah z& | ||
+ | |||
+ | The recommended way to extract a grammar is to configure an ant XML file for ExtractRules. All available parameters can be configured using that technique. The main method is meant now to just be a simple version for use if you don't need any custom configuration. | ||
+ | |||
+ | The current version of ExtractRules and its parameters are documented in my and Chris' | ||
+ | http:// | ||
+ | |||
+ | The ant file should look something like this: | ||
+ | |||
+ | extract.xml: | ||
+ | |||
+ | <code xml>< | ||
+ | |||
+ | <!-- Define the path to Joshua class files --> | ||
+ | < | ||
+ | value="/ | ||
+ | |||
+ | <!-- Define the ant task to compile a corpus into binary memory-mappable files --> | ||
+ | < | ||
+ | classname=" | ||
+ | classpath=" | ||
+ | |||
+ | <!-- Define the ant task to extract rules --> | ||
+ | < | ||
+ | classname=" | ||
+ | classpath=" | ||
+ | |||
+ | |||
+ | <!-- Declare a target to compile a corpus --> | ||
+ | <target name=" | ||
+ | description=" | ||
+ | < | ||
+ | sourceCorpus="/ | ||
+ | targetCorpus="/ | ||
+ | alignments="/ | ||
+ | outputDir="/ | ||
+ | /> | ||
+ | </ | ||
+ | |||
+ | |||
+ | <!-- Declare a target to extract a grammar --> | ||
+ | <target name=" | ||
+ | description=" | ||
+ | < | ||
+ | joshDir="/ | ||
+ | outputFile="/ | ||
+ | testFile="/ | ||
+ | /> | ||
+ | </ | ||
+ | |||
+ | |||
+ | <!-- Declare a target to extract a grammar with other parameters--> | ||
+ | <target name=" | ||
+ | description=" | ||
+ | < | ||
+ | joshDir="/ | ||
+ | outputFile="/ | ||
+ | testFile="/ | ||
+ | maxPhraseSpan=" | ||
+ | maxPhraseLength=" | ||
+ | requireTightSpans=" | ||
+ | edgeXViolates=" | ||
+ | sentenceInitialX=" | ||
+ | sentenceFinalX=" | ||
+ | ruleSampleSize=" | ||
+ | maxNonterminals=" | ||
+ | /> | ||
+ | </ | ||
+ | |||
+ | </ | ||
+ | |||
+ | You can call this, with any of the targets that you define in extract.xml, | ||
+ | |||
+ | <code bash># Compile the corpus | ||
+ | ant -f extract.xml compile_de-en | ||
+ | |||
+ | # Extract rules using defaults | ||
+ | ant -f extract.xml extract_de-en | ||
+ | |||
+ | # Extract rules using custom settings | ||
+ | ant -f extract.xml extract_de-en-custom</ | ||
===== Decoding ===== | ===== Decoding ===== | ||
Line 305: | Line 393: | ||
$HINDI/ | $HINDI/ | ||
> $HINDI/ | > $HINDI/ | ||
+ | |||
+ | ===== Troubleshooter ===== | ||
+ | |||
+ | |||
+ | |||
+ | ==== Grammar extraction: Negative array size ==== | ||
+ | |||
+ | If you encounter this exception during corpus binarization or (in older releases of Joshua) during grammar extraction, check your alignment file whether it matches your source and target corpus. Did you switch translation direction accidentially? | ||
+ | |||
+ | Another source of this error could be sentences with 100 or more words. This is not a strict limit, often I was able to extract grammars for corpora unchecked for such sentences, but according to Lance Schwartz, long sentences can cause problems. (For me, filtering out such sentences helped with es-en WMT08 training data.) And after all, they are suspicious anyway, and their contribution to the learnt model is doubtful. | ||
+ | |||
+ | ==== ZMERT: corrupted temp file ==== | ||
+ | |||
+ | Hi all, | ||
+ | |||
+ | does the following ZMERT exception look familiar to anyone? My only idea was that the nbest output from the decoder is corrupted somehow. However, I cannot find anything strange in it, such as sequence of more then three " | ||
+ | |||
+ | Thanks, | ||
+ | Dan | ||
+ | |||
+ | zmert.out: | ||
+ | ----- | ||
+ | < | ||
+ | Processed the following args array: | ||
+ | -dir / | ||
+ | |||
+ | ---------------------------------------------------- | ||
+ | Initializing... | ||
+ | ---------------------------------------------------- | ||
+ | |||
+ | Random number generator initialized using seed: 12341234 | ||
+ | |||
+ | Number of sentences: 2051 | ||
+ | Number of documents: 1 | ||
+ | Optimizing BLEU | ||
+ | docSubsetInfo: | ||
+ | Number of features: 5 | ||
+ | Feature names: {" | ||
+ | |||
+ | c Default value Optimizable? | ||
+ | 1 | ||
+ | 2 | ||
+ | 3 | ||
+ | 4 | ||
+ | 5 | ||
+ | |||
+ | Weight vector normalization method: weights will be scaled so that the " | ||
+ | |||
+ | ---------------------------------------------------- | ||
+ | |||
+ | ---------------------------------------------------- | ||
+ | Z-MERT run started @ Sat Mar 06 23:52:57 CET 2010 | ||
+ | ---------------------------------------------------- | ||
+ | |||
+ | Initial lambda[]: {1.0, 1.066893, 0.752247, 0.589793, -2.844814} | ||
+ | |||
+ | --- Starting Z-MERT iteration #1 @ Sat Mar 06 23:52:57 CET 2010 --- | ||
+ | Decoding using initial weight vector {1.0, 1.066893, 0.752247, 0.589793, -2.844814} | ||
+ | Running external decoder... | ||
+ | ...finished decoding @ Sun Mar 07 00:02:33 CET 2010 | ||
+ | Reading candidate translations from iterations 1-1 | ||
+ | (and computing BLEU sufficient statistics for previously unseen candidates) | ||
+ | | ||
+ | Exception in thread " | ||
+ | at java.lang.NumberFormatException.forInputString(NumberFormatException.java: | ||
+ | at java.lang.Integer.parseInt(Integer.java: | ||
+ | at java.lang.Integer.parseInt(Integer.java: | ||
+ | at joshua.zmert.MertCore.run_single_iteration(MertCore.java: | ||
+ | at joshua.zmert.MertCore.main(MertCore.java: | ||
+ | Z-MERT exiting prematurely (MertCore returned 1)...</ | ||
+ | ----- | ||
+ | |||
+ | Omar's response: | ||
+ | |||
+ | Hi Dan, | ||
+ | |||
+ | The " | ||
+ | if there are any *temp* (or *tmp*) files in the folder from earlier | ||
+ | runs, make sure you delete them first, then try launching Z-MERT | ||
+ | again. | ||
+ | not delete them because they can be used to restart Z-MERT from the | ||
+ | point where it crashed. | ||
+ | loss or an interrupted job, etc. In your case, I think what happened | ||
+ | is that a prior run crashed because of an external problem in the | ||
+ | setup itself, which you fixed and tried to restart Z-MERT. | ||
+ | reason, Z-MERT should not be using those temp files in the first | ||
+ | place, but when it sees them there, it assumes it can use them because | ||
+ | the user did not delete them. | ||
+ | |||
+ | Let me know if that's not the case. | ||
+ | |||
+ | O.Z. | ||
+ |