This is an old revision of the document!
Table of Contents
Khresmoi
Medical Information Analysis & Retrieval
http://www.khresmoi.eu/
People and contacts
- JH = Jan Hajič <hajic (at) ufal.mff.cuni.cz>
- PP = Pavel Pecina <pecina (at) ufal.mff.cuni.cz>
- JHla = Jaroslava Hlaváčová <hlava (at) ufal.mff.cuni.cz>
- JD = Jan Dědek <dedek (at) ksi.mff.cuni.cz>
- JB = Jakub Bystroň <jb.elitecode (at) gmail.com>
Data
MT training data available for KHRESMOI
Corpus | Source | Domain | EN-FR | EN-DE | EN | FR | DE | Note |
---|---|---|---|---|---|---|---|---|
TDA translation memory | TDA | in | 13517 Kw | 6797 Kw | PP | |||
CESTA Evaluation Package | ELRA | in | 38 Kw | waiting | ||||
EQueR Evaluation Package | ELRA | in | 140 MiB | waiting | ||||
CESART Evaluation Package | ELRA | in | 9000 Kw | waiting | ||||
French Gigaword | LDC | news | 863 Kw | DVD | ||||
Acquis | JRC | law | 1,25 Ms (?3,034 Ms) | (3,128 Ms) | JHla | |||
EMEA | European Medicines Agency | in | 373 Ks | 12 Mw | 26.34 Mw | 14.9Mw | JHla, JB | |
MESH | U.S. National Library of Medicine | in | 838 kw | JHla | ||||
OrphaNet | OrphaNet | in | ? | Wien will do | ||||
Europarl | WMT12 | parl | 1.8Ms/47Mw | 1.7Ms/43Mw | JHla | |||
News Commentary | WMT12 | news | 43ks/0.9Mw | 60ks/1.2Mw | JHla | |||
News monolingual | WMT12 | news | 181kw | 147kw | 162kw | JHla | ||
United Nations | WMT12 | ? | ? | JHla | ||||
French-English 109 corpus | WMT12 | web | 22.5Ms | JHla | ||||
Medpedia wiki | Medpedia | in | ? | only EN found | ||||
Corpus Of Parallel Patent Applications (Coppa) | WIPO | in/all | 1.6Mf/170Mw | waiting for DVD JHla | ||||
Springer Bilingual Corpus | much.more | in | 1.09 Mw | JB |
k, M … thousand, milion
w, s, f … words, sentences, files (for parallel data only source (English) words are counted)
161805 3419087 25531801 training-monolingual/news-commentary-v6.de
180657 3798233 23801236 training-monolingual/news-commentary-v6.en 147251 3588247 23741477 training-monolingual/news-commentary-v6.fr
JRC Acquis by mel mit pres 3 Ms:
http://optima.jrc.it/Acquis/JRC-Acquis.3.0/alignmentsHunAlign/index.html
Zdroje
MAREC
A61 (MEDICAL OR VETERINARY SCIENCE; HYGIENE): 1.589,849 files
Nevím, kolik slov, není to v jednolitém balíku.
Khresmoi wiki
http://wiki.khresmoi.eu/index.php5/Data_sets_used
http://wiki.khresmoi.eu/index.php5/Data_sets
www stranka WMT workshopu
http://www.statmt.org/wmt12/
http://www.statmt.org/wmt11/translation-task.html … tady je to vsecko pohromade
korpus OPUS
http://opus.lingfil.uu.se/
JRC Acquis
http://langtech.jrc.it/JRC-Acquis.html
ELDA
Objednali jsme několik balíčků s in-domain daty (EN-FR, FR)
TDA
Máme kredit na stažení 1 mld. slov. Zatím stažena EN-FR, EN-DE in-domain data.
LDC
Paralelní data
Mono data
Dokumenty
SVN
Prosím PP o doplnění