DBMT

Czech-English Dependency-based Machine Translation – Čmejrek, Cuřín, and Havelka 03
PCEDT BLEU dtest/etest: 0.1974 0.1704

je to Magenta pipeline, jen generovani je rule-based (misto statistikeho tree-to-tree transducing a pak LM)

na českém prekladu Penn Treebanku

  1. tokenizace a tagging [Hajic 98]
  2. parsing do a_trees [Hajic 98, Charniak 99]
  3. afun assigment [ZZ 02]
  4. a_tree → t_tree [Bohmova 01]
  5. func assigment C4.5 [ZZ 02]
  6. slovnik pomoci GIZA++ [Och and Nay 02] one most probable translation, 1-2 as 1-1 multiword
    - generator

    ====== Generator ======
    dostane TGTS bez tfa, a co koreference :?:

    == 1. determining contextual boundness ==
    povazuji v CZ od slovesa vlevo jako CB → definite article
    od slovesa vpravo nezapojene → indefinite article


    == 2. reordering of constituents ==
    podle CB se z ACT|PAT|ADDR vyberou Sb
    declarative sentence: CB adjuncts + Sb + V + direct/indirect Obj + UB adjuncts

    == 3. generation of verb forms ==
    pasivum/aktivum se dela podle Sb funktoru

    == 4. insertion of prepositions and articles ==
    preps: podle české a podle EN nounu

    articles: definite při postmodified NP, premodified by superlative or ordinal num
    article prevented: uncountable + proper nouns nebo predetermination by possessive a demonstrative pronouns

    == 5. morphology ==
    asi ne morpha :!:
    hledaji v tabulce
    ^ word form ^ morphological tag ^ lemma ^
    kdyz nenajdou tak somple rules
    taky vokalizace pro indefinite article