This is an old revision of the document!
Table of Contents
Khresmoi
Medical Information Analysis & Retrieval
http://www.khresmoi.eu/
People and contacts
- JH = Jan Hajič <hajic (at) ufal.mff.cuni.cz>
- PP = Pavel Pecina <pecina (at) ufal.mff.cuni.cz>
- JHla = Jaroslava Hlaváčová <hlava (at) ufal.mff.cuni.cz>
- JD = Jan Dědek <dedek (at) ksi.mff.cuni.cz>
- JB = Jakub Bystroň <jb.elitecode (at) gmail.com>
- ZU = Zdeňka Urešová <uresova (at) ufal.mff.cuni.cz>
Data
Jsou zde /net/data/khresmoi
MT training data available for KHRESMOI
Kliknutím na korpus (první sloupec) se dostanete do sekce Poznámky k datům.
Corpus | Source | Domain | EN-FR | EN-DE | alignment | EN | FR | DE | Note |
---|---|---|---|---|---|---|---|---|---|
TDA translation memory | TDA | in | 13517 Kw | 6797 Kw | sent | DONE:tda | |||
CESTA Evaluation Package | ELRA | in | 38 Kw | sent | DONE:cesta | ||||
EQueR Evaluation Package | ELRA | in | 140 MiB | DONE:equer | |||||
CESART Evaluation Package | ELRA | in | 9000 Kw | DONE:cesart | |||||
French Gigaword | LDC | news | 863 Kw | DONE:gigaword | |||||
Acquis | JRC | law | 1,25 Ms | 1,33 Ms | sent | DONE:jrc | |||
EMEA | European Medicines Agency | in | 373 Ks | 12 Mw | 26.34 Mw | 14.9Mw | DONE:emea | ||
MESH | U.S. National Library of Medicine | in | 838 kw | DONE:mesh* | |||||
OrphaNet | OrphaNet | in | ? | Wien will do | |||||
Europarl | WMT12 | parl | 1.8Ms | 1.7Ms | sent | DONE:europarl | |||
News Commentary | WMT12 | news | 43ks | 60ks | sent | DONE:news-com | |||
News monolingual | WMT12 | news | 181kw | 147kw | 162kw | DONE:wmt-news | |||
United Nations | WMT12 | un | 12.3Ms | DONE:undoc | |||||
French-English 109 corpus | WMT12 | web | 22.5Ms | sent | DONE:giga | ||||
Medpedia wiki | Medpedia | in | ? | only EN found | |||||
Coppa (patenty) | WIPO | in | 24,8Mw = 1,2Ms | sent | DONE:wipo | ||||
Coppa (patenty) | WIPO | in | 33,5Mw | par | DONE:wipo | ||||
Coppa (patenty) | WIPO | tech | 153,8Mw = 7,5Ms | sent | DONE:wipo | ||||
Coppa (patenty) | WIPO | tech | 178,8Mw | par | DONE:wipo | ||||
MAREC | Wien TU | in | ? | ? | ? | ||||
Springer Bilingual Corpus | much.more | in | 1.09 Mw | sent | JB | ||||
Europarl3 | OPUS | 1.3 Ms | sent | neni poreba | |||||
OpenSubtitles2011 | OPUS | 5 Ms | sent | JB | |||||
Czeng | UFAL | sent | JB | ||||||
Drugbank | drugbank.ca | in | 624kw | DONE:drugbank | |||||
FMA | Foundational Model of Anatomy ontology | in | 855,5kw | DONE:fma |
Vysvětlivky
k, M … thousand, milion
w, s, f … words, sentences, files (for parallel data only source (English) words are counted)
* viz podrobnější info v podsekcích
Sloupec Note obsahuje název podadresáře /net/data/khresmoi, kde je uložen výsledek
stažená data, ale nezpracovaná
nevíme, jestli chceme
chceme stahovat, ale zatím nevíme, jak na to … z různých příčin
čekáme na data
nějaký problém, podrobněji v poznámkách - prokliknout z prvního sloupce
Další odkazy
Khresmoi wiki
http://wiki.khresmoi.eu/index.php5/Data_sets_used
http://wiki.khresmoi.eu/index.php5/Data_sets
www stranka WMT workshopu
http://www.statmt.org/wmt12/
http://www.statmt.org/wmt11/translation-task.html … tady je to vsecko pohromade
korpus OPUS
http://opus.lingfil.uu.se/