This is an old revision of the document!
Table of Contents
Khresmoi
Medical Information Analysis & Retrieval
http://www.khresmoi.eu/
People and contacts
- JH = Jan Hajič <hajic (at) ufal.mff.cuni.cz>
- PP = Pavel Pecina <pecina (at) ufal.mff.cuni.cz>
- JHla = Jaroslava Hlaváčová <hlava (at) ufal.mff.cuni.cz>
- JD = Jan Dědek <dedek (at) ksi.mff.cuni.cz>
- JB = Jakub Bystroň <jb.elitecode (at) gmail.com>
Data
MT training data available for KHRESMOI
Corpus | Source | Domain | EN-FR | EN-DE | EN | FR | DE | Note |
---|---|---|---|---|---|---|---|---|
TDA translation memory | TDA | in | 13517 Kw | 6797 Kw | PP | |||
CESTA Evaluation Package | ELRA | in | 38 Kw | waiting | ||||
EQueR Evaluation Package | ELRA | in | 140 MiB | waiting | ||||
CESART Evaluation Package | ELRA | in | 9000 Kw | waiting | ||||
French Gigaword | LDC | news | 863 Kw | DVD | ||||
Acquis | JRC | law | 1,25 Ms (?3,034 Ms) | (3,128 Ms) | JHla | |||
EMEA | European Medicines Agency | in | 373 Ks | JHla | ||||
EMEA | European Medicines Agency | in | 14.9Mw | JHla | ||||
EMEA | European Medicines Agency | in | 26.34 Mw | JHla | ||||
MESH | U.S. National Library of Medicine | in | 838 kw | JHla | ||||
OrphaNet | OrphaNet | in | ? | negotiating | ||||
Europarl | WMT12 | parl | ? | ? | JHla | |||
News Commentary | WMT12 | news | ? | ? | JHla | |||
News monolingual | WMT12 | news | JHla | |||||
United Nations | WMT12 | ? | ? | JHla | ||||
French-English 109 corpus | WMT12 | web | ? | JHla | ||||
Medpedia wiki | Medpedia | in | ? | only EN found | ||||
MAREC | IPC | in | ? | ? | ? | contacted JHla | ||
Springer Bilingual Corpus | much.more | in | 1.09 Mw | JB |
k, M … thousand, milion
w, s … words, sentences (for parallel data only source (English) words are counted)
JRC Acquis by mel mit pres 3 Ms:
http://optima.jrc.it/Acquis/JRC-Acquis.3.0/alignmentsHunAlign/index.html
Zdroje
MAREC
A61 (MEDICAL OR VETERINARY SCIENCE; HYGIENE): 1.589,849 files
Nevím, kolik slov, není to v jednolitém balíku.
Khresmoi wiki
http://wiki.khresmoi.eu/index.php5/Data_sets_used
http://wiki.khresmoi.eu/index.php5/Data_sets
www stranka WMT workshopu
http://www.statmt.org/wmt12/
korpus OPUS
http://opus.lingfil.uu.se/
JRC Acquis
http://langtech.jrc.it/JRC-Acquis.html
ELDA
Objednali jsme několik balíčků s in-domain daty (EN-FR, FR)
TDA
Máme kredit na stažení 1 mld. slov. Zatím stažena EN-FR, EN-DE in-domain data.
LDC
Paralelní data
Mono data
Dokumenty
SVN
Prosím PP o doplnění