Differences
This shows you the differences between two versions of the page.
Both sides previous revision Previous revision Next revision | Previous revision Next revision Both sides next revision | ||
user:zeman:joshua [2009/06/04 11:19] zeman Příprava spuštění Z-MERTu. |
user:zeman:joshua [2010/03/07 22:37] zeman Troubleshooter. |
||
---|---|---|---|
Line 9: | Line 9: | ||
* http:// | * http:// | ||
* http:// | * http:// | ||
+ | * http:// | ||
===== Instalace ===== | ===== Instalace ===== | ||
Line 27: | Line 28: | ||
< | < | ||
svn co https:// | svn co https:// | ||
+ | |||
+ | Poznámka: Na začátku června 2009 jsem měl problémy se zdrojáky získanými přímo z SVN (třída SuffixArray neobsahovala metodu main()), ale možná to byl jen dočasný výpadek dokumentace, | ||
+ | |||
+ | http:// | ||
+ | |||
+ | přešel na link Download a stáhnul si aktuální release verzi. | ||
Přeložit Joshuu: | Přeložit Joshuu: | ||
Line 79: | Line 86: | ||
qstat -u ' | qstat -u ' | ||
- | ===== Použití | + | |
+ | ===== Extrakce gramatiky | ||
Joshua je nainstalován a funguje. Nyní se musíme naučit, jak ho trénovat a jak ho použít k překladu. | Joshua je nainstalován a funguje. Nyní se musíme naučit, jak ho trénovat a jak ho použít k překladu. | ||
Line 86: | Line 94: | ||
* Tokenizovaný a segmentovaný text ve zdrojovém jazyce (en). | * Tokenizovaný a segmentovaný text ve zdrojovém jazyce (en). | ||
* Tokenizovaný a segmentovaný text v cílovém jazyce (hi). | * Tokenizovaný a segmentovaný text v cílovém jazyce (hi). | ||
- | * Párování vyrobíme [[Giza++|Gizou++]]. | + | * Párování vyrobíme [[Giza++|Gizou++]]. Správný soubor s párováním |
- | + | ||
- | Správný soubor s párováním | + | |
- | + | ||
- | < | + | |
- | 0-3 7-4 8-5 9-6 10-7 11-8 12-9 13-10 14-11 15-12 16-13 4-15 2-17 3-18 20-19 18-21 21-22 22-23 22-24 22-25 19-26 23-27 | + | |
- | 0-0 1-1 2-1 3-2 4-4 5-5 7-9 8-16 9-17 10-17 12-17 13-17 14-17 15-17 17-17 18-17 11-18 18-19 18-20 19-21 | + | |
- | 1-0 4-2 6-4 7-5 7-6 5-7 7-7 6-8 8-9 7-10 8-11 8-12 8-13 11-14 12-17 | + | |
- | 0-0 1-1 2-1 3-1 7-2 8-3 9-4 6-5 11-6 11-7 12-10 13-11 14-12 15-13 16-14 22-15 23-15 21-16 26-17 17-20 28-22 29-23 27-26 25-28 30-29 31-30 32-30 33-30 33-31 33 | + | |
- | -32 34-33</ | + | |
A takhle pustíme Joshuu, aby z trénovacích dat extrahoval gramatiku. Joshua z nějakého důvodu vyžaduje také testovací soubor se zdrojovým jazykem. Soudě podle příkladu, který dodali, stačí zkopírovat první větu ze zdrojových trénovacích dat. Gramatiku je pak ještě třeba seřadit, vyházet duplicitní pravidla a zagzipovat. | A takhle pustíme Joshuu, aby z trénovacích dat extrahoval gramatiku. Joshua z nějakého důvodu vyžaduje také testovací soubor se zdrojovým jazykem. Soudě podle příkladu, který dodali, stačí zkopírovat první větu ze zdrojových trénovacích dat. Gramatiku je pak ještě třeba seřadit, vyházet duplicitní pravidla a zagzipovat. | ||
Line 106: | Line 105: | ||
setenv GRM en-hi.grammar | setenv GRM en-hi.grammar | ||
head -1 $SRC > $TST | head -1 $SRC > $TST | ||
- | java -cp $JOSHUA/bin joshua.prefix_tree.ExtractRules --source=$SRC --target=$TGT --alignments=$ALI --test=$TST --output=$GRM.unsorted --maxPhraseLength=5 | + | java -cp $JOSHUA/bin joshua.prefix_tree.ExtractRules |
+ | | ||
+ | | ||
sort -u $GRM.unsorted > $GRM | sort -u $GRM.unsorted > $GRM | ||
gzip $GRM</ | gzip $GRM</ | ||
Line 116: | Line 117: | ||
Binarizovat zdrojovou část korpusu. | Binarizovat zdrojovou část korpusu. | ||
- | < | + | < |
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | java -cp $JOSHUA/bin joshua.corpus.suffix_array.SuffixArray \ | ||
+ | $WORK/ | ||
+ | $WORK/ | ||
+ | $WORK/ | ||
+ | $WORK/ | ||
+ | java -cp $JOSHUA/bin joshua.corpus.alignment.AlignmentGrids \ | ||
+ | $WORK/ | ||
+ | $WORK/ | ||
Takto se extrahuje gramatika pro konkrétní testovací data s pomocí binarizovaného korpusu: | Takto se extrahuje gramatika pro konkrétní testovací data s pomocí binarizovaného korpusu: | ||
- | < | + | < |
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
+ | | ||
===== Decoding ===== | ===== Decoding ===== | ||
Line 273: | Line 293: | ||
< | < | ||
-s src.txt # source sentences file name | -s src.txt # source sentences file name | ||
- | -r ref | + | -r ref.txt # target sentences file name |
-rps 1 # references per sentence | -rps 1 # references per sentence | ||
-maxIt 5 # maximum MERT iterations | -maxIt 5 # maximum MERT iterations | ||
Line 285: | Line 305: | ||
$HINDI/ | $HINDI/ | ||
> $HINDI/ | > $HINDI/ | ||
+ | |||
+ | ===== Troubleshooter ===== | ||
+ | |||
+ | ==== Grammar extraction: Negative array size ==== | ||
+ | |||
+ | If you encounter this exception during corpus binarization or (in older releases of Joshua) during grammar extraction, check your alignment file whether it matches your source and target corpus. Did you switch translation direction accidentially? | ||
+ | |||
+ | ==== ZMERT: corrupted temp file ==== | ||
+ | |||
+ | Hi all, | ||
+ | |||
+ | does the following ZMERT exception look familiar to anyone? My only idea was that the nbest output from the decoder is corrupted somehow. However, I cannot find anything strange in it, such as sequence of more then three " | ||
+ | |||
+ | Thanks, | ||
+ | Dan | ||
+ | |||
+ | zmert.out: | ||
+ | ----- | ||
+ | < | ||
+ | Processed the following args array: | ||
+ | -dir / | ||
+ | |||
+ | ---------------------------------------------------- | ||
+ | Initializing... | ||
+ | ---------------------------------------------------- | ||
+ | |||
+ | Random number generator initialized using seed: 12341234 | ||
+ | |||
+ | Number of sentences: 2051 | ||
+ | Number of documents: 1 | ||
+ | Optimizing BLEU | ||
+ | docSubsetInfo: | ||
+ | Number of features: 5 | ||
+ | Feature names: {" | ||
+ | |||
+ | c Default value Optimizable? | ||
+ | 1 | ||
+ | 2 | ||
+ | 3 | ||
+ | 4 | ||
+ | 5 | ||
+ | |||
+ | Weight vector normalization method: weights will be scaled so that the " | ||
+ | |||
+ | ---------------------------------------------------- | ||
+ | |||
+ | ---------------------------------------------------- | ||
+ | Z-MERT run started @ Sat Mar 06 23:52:57 CET 2010 | ||
+ | ---------------------------------------------------- | ||
+ | |||
+ | Initial lambda[]: {1.0, 1.066893, 0.752247, 0.589793, -2.844814} | ||
+ | |||
+ | --- Starting Z-MERT iteration #1 @ Sat Mar 06 23:52:57 CET 2010 --- | ||
+ | Decoding using initial weight vector {1.0, 1.066893, 0.752247, 0.589793, -2.844814} | ||
+ | Running external decoder... | ||
+ | ...finished decoding @ Sun Mar 07 00:02:33 CET 2010 | ||
+ | Reading candidate translations from iterations 1-1 | ||
+ | (and computing BLEU sufficient statistics for previously unseen candidates) | ||
+ | | ||
+ | Exception in thread " | ||
+ | at java.lang.NumberFormatException.forInputString(NumberFormatException.java: | ||
+ | at java.lang.Integer.parseInt(Integer.java: | ||
+ | at java.lang.Integer.parseInt(Integer.java: | ||
+ | at joshua.zmert.MertCore.run_single_iteration(MertCore.java: | ||
+ | at joshua.zmert.MertCore.main(MertCore.java: | ||
+ | Z-MERT exiting prematurely (MertCore returned 1)...</ | ||
+ | ----- | ||
+ | |||
+ | Omar's response: | ||
+ | |||
+ | Hi Dan, | ||
+ | |||
+ | The " | ||
+ | if there are any *temp* (or *tmp*) files in the folder from earlier | ||
+ | runs, make sure you delete them first, then try launching Z-MERT | ||
+ | again. | ||
+ | not delete them because they can be used to restart Z-MERT from the | ||
+ | point where it crashed. | ||
+ | loss or an interrupted job, etc. In your case, I think what happened | ||
+ | is that a prior run crashed because of an external problem in the | ||
+ | setup itself, which you fixed and tried to restart Z-MERT. | ||
+ | reason, Z-MERT should not be using those temp files in the first | ||
+ | place, but when it sees them there, it assumes it can use them because | ||
+ | the user did not delete them. | ||
+ | |||
+ | Let me know if that's not the case. | ||
+ | |||
+ | O.Z. | ||
+ |