[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
courses:rg:2012:encouraging-consistent-translation [2012/10/17 11:40]
dusek
courses:rg:2012:encouraging-consistent-translation [2012/10/17 11:43]
dusek
Line 45: Line 45:
   * They define 3 features that are designed to be biased towrds consistency -- or are they?   * They define 3 features that are designed to be biased towrds consistency -- or are they?
     * If e.g. two variants are used 2 times each, they will have roughly the same score     * If e.g. two variants are used 2 times each, they will have roughly the same score
-  * The new features need two passes through the data +  * The BM25 function is a refined version of the [[http://en.wikipedia.org/wiki/TF-IDF|TF-IDF]] score
-  * The BM25 function is a refined version of [[http://en.wikipedia.org/wiki/TF-IDF|TF-IDF]] score+
   * The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway)   * The exact parameter values are probably not tuned, left at a default value (and maybe they don't have much influence anyway)
    * See NPFL103 for details on Information retrieval, it's largely black magic    * See NPFL103 for details on Information retrieval, it's largely black magic
Line 56: Line 55:
   * The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices?   * The resulting weights are not mentioned, but if the weight is < 0, will this favor different translation choices?
  
-**Differences in features**+**Meaning of the individual features**
   * C1 indicates that a certain Hiero rule was used frequently   * C1 indicates that a certain Hiero rule was used frequently
     * but rules are very similar, so we also need something less fine-grained     * but rules are very similar, so we also need something less fine-grained
-  * C2 is a target-side feature, just counts the target side tokens+  * C2 is a target-side feature, just counts the target side tokens (only the "most important" ones; in terms of TF-IDF)
     * It may be compared to Language Model features, but is trained only on the target part of the bilingual training data.     * It may be compared to Language Model features, but is trained only on the target part of the bilingual training data.
 +  * C3 counts occurrences of source-target token pairs (and uses the "most important" term pair for each rule, again)
  
 +**Requirements of the new features**
 +  * They need two passes through the data
 +  * You need to have document segmentation
 +    * Since the frequencies are trained on the training set, you can just translate one document at a time, no need to have full sets of documents

[ Back to the navigation ] [ Back to the content ]