This is an old revision of the document!
Table of Contents
Khresmoi
Medical Information Analysis & Retrieval
http://www.khresmoi.eu/
People and contacts
- JH = Jan Hajič <hajic (at) ufal.mff.cuni.cz>
- PP = Pavel Pecina <pecina (at) ufal.mff.cuni.cz>
- JHla = Jaroslava Hlaváčová <hlava (at) ufal.mff.cuni.cz>
- JD = Jan Dědek <dedek (at) ksi.mff.cuni.cz>
- JB = Jakub Bystroň <jb.elitecode (at) gmail.com>
- ZU = Zdeňka Urešová <uresova (at) ufal.mff.cuni.cz>
Data
MT training data available for KHRESMOI
Corpus | Source | Domain | EN-FR | EN-DE | alignment | EN | FR | DE | Note |
---|---|---|---|---|---|---|---|---|---|
TDA translation memory | TDA | in | 13517 Kw | 6797 Kw | sent | DONE | |||
CESTA Evaluation Package | ELRA | in | 38 Kw | sent | PROCESSING | ||||
EQueR Evaluation Package | ELRA | in | 140 MiB | PROCESSING | |||||
CESART Evaluation Package | ELRA | in | 9000 Kw | PROCESSING | |||||
French Gigaword | LDC | news | 863 Kw | DVD | |||||
Acquis | JRC | law | 1,25 Ms | 1,33 Ms | sent | JHla (jen FR) | |||
EMEA | European Medicines Agency | in | 373 Ks | 12 Mw | 26.34 Mw | 14.9Mw | DONE - i CS | ||
MESH | U.S. National Library of Medicine | in | 838 kw | DONE* | |||||
OrphaNet | OrphaNet | in | ? | Wien will do | |||||
Europarl | WMT12 | parl | 1.8Ms | 1.7Ms | sent | DONE | |||
News Commentary | WMT12 | news | 43ks | 60ks | sent | DONE | |||
News monolingual | WMT12 | news | 181kw | 147kw | 162kw | DONE | |||
United Nations | WMT12 | un | 12.3Ms | DONE | |||||
French-English 109 corpus | WMT12 | web | 22.5Ms | sent | DONE | ||||
Medpedia wiki | Medpedia | in | ? | only EN found | |||||
Corpus Of Parallel Patent Applications (Coppa) | WIPO | in | 24,8Mw = 1,2Ms | sent | JHla | ||||
Corpus Of Parallel Patent Applications (Coppa) | WIPO | in | 33,5Mw | par | JHla | ||||
Corpus Of Parallel Patent Applications (Coppa) | WIPO | tech | 153,8Mw = 7,5Ms | sent | JHla | ||||
Corpus Of Parallel Patent Applications (Coppa) | WIPO | tech | 178,8Mw | par | JHla | ||||
MAREC | Wien TU | in | ? | ? | ? | viz níže | |||
Springer Bilingual Corpus | much.more | in | 1.09 Mw | sent | JB | ||||
Europarl3 | OPUS | 1.3 Ms | sent | neni poreba | |||||
OpenSubtitles2011 | OPUS | 5 Ms | sent | JB |
k, M … thousand, milion
w, s, f … words, sentences, files (for parallel data only source (English) words are counted)
* viz podrobnější info v podsekcích
Zdroje
JRC Acquis
http://optima.jrc.it/Acquis/JRC-Acquis.3.0/alignmentsHunAlign/index.html
MAREC
A61 (MEDICAL OR VETERINARY SCIENCE; HYGIENE): 1.589,849 files
Nevím, kolik slov, není to v jednolitém balíku.
Na žádost o přístup odpověděli:
the IRF is not granting access to the MAREC collection anymore. However, the access for research purposes should be possible in a foreseeable future via the Vienna University of Technology - Allan will certainly come back to you when the legal status is cleared.
Coppa
IPC: A61, C12N, C12P … medical patents (doporučeno od WIPO)
Patenty členěny podle roku, ve dvou verzích:
- segmentované podle vět, ale menší - viz tabulka. Některé patenty chybí zcela, některé jsou zkráceny.
- nesegmentované - každý patent má 2 záznamy: jméno a abstract, obojí v EN i FR, tedy alignment podle paragrafu (odhad)
Khresmoi wiki
http://wiki.khresmoi.eu/index.php5/Data_sets_used
http://wiki.khresmoi.eu/index.php5/Data_sets
www stranka WMT workshopu
http://www.statmt.org/wmt12/
http://www.statmt.org/wmt11/translation-task.html … tady je to vsecko pohromade
korpus OPUS
http://opus.lingfil.uu.se/
JRC Acquis
http://langtech.jrc.it/JRC-Acquis.html
ELDA
Objednali jsme několik balíčků s in-domain daty (EN-FR, FR)
TDA
Máme kredit na stažení 1 mld. slov. Zatím stažena EN-FR, EN-DE in-domain data.
LDC
Paralelní data
Mono data
Dokumenty
Úložiště
/net/data/khresmoi