[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Khresmoi

Medical Information Analysis & Retrieval
http://www.khresmoi.eu/

People and contacts


Data

MT training data available for KHRESMOI

Corpus Source Domain EN-FR EN-DE EN FR DE Note
TDA translation memory TDA in 13517 Kw 6797 Kw 8-) PP
CESTA Evaluation Package ELRA in 38 Kw waiting
EQueR Evaluation Package ELRA in 140 MiB waiting
CESART Evaluation Package ELRA in 9000 Kw waiting
French Gigaword LDC news 863 Kw 8-) DVD
Acquis JRC law 1,25 Ms (?3,034 Ms) (3,128 Ms) 8-) JHla
EMEA European Medicines Agency in 373 Ks 12 Mw 26.34 Mw 14.9Mw 8-) JHla, JB
MESH U.S. National Library of Medicine in 838 kw 8-) JHla
OrphaNet OrphaNet in ? Wien will do
Europarl WMT12 parl 1.8Ms/47Mw 1.7Ms/43Mw 8-) JHla
News Commentary WMT12 news 43ks/0.9Mw 60ks/1.2Mw 8-) JHla
News monolingual WMT12 news 181kw 147kw 162kw 8-) JHla
United Nations WMT12 news 12.3Ms 8-) JHla
French-English 109 corpus WMT12 web 22.5Ms 8-) JHla
Medpedia wiki Medpedia in ? only EN found
Corpus Of Parallel Patent Applications (Coppa) WIPO in/all 1.6Mf/170Mw waiting for DVD JHla
Springer Bilingual Corpus much.more in 1.09 Mw 8-) JB

k, M … thousand, milion
w, s, f … words, sentences, files (for parallel data only source (English) words are counted)

161805 3419087 25531801 training-monolingual/news-commentary-v6.de

180657  3798233 23801236 training-monolingual/news-commentary-v6.en
147251  3588247 23741477 training-monolingual/news-commentary-v6.fr

JRC Acquis by mel mit pres 3 Ms:
http://optima.jrc.it/Acquis/JRC-Acquis.3.0/alignmentsHunAlign/index.html

Zdroje

MAREC
A61 (MEDICAL OR VETERINARY SCIENCE; HYGIENE): 1.589,849 files
Nevím, kolik slov, není to v jednolitém balíku.

Khresmoi wiki
http://wiki.khresmoi.eu/index.php5/Data_sets_used
http://wiki.khresmoi.eu/index.php5/Data_sets

www stranka WMT workshopu
http://www.statmt.org/wmt12/
http://www.statmt.org/wmt11/translation-task.html … tady je to vsecko pohromade

korpus OPUS
http://opus.lingfil.uu.se/

JRC Acquis
http://langtech.jrc.it/JRC-Acquis.html

ELDA

Objednali jsme několik balíčků s in-domain daty (EN-FR, FR)

TDA

Máme kredit na stažení 1 mld. slov. Zatím stažena EN-FR, EN-DE in-domain data.

LDC

Paralelní data

EN-FR
EN-DE

Mono data

FR
DE
EN


Dokumenty


SVN

Prosím PP o doplnění



[ Back to the navigation ] [ Back to the content ]