[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
ufal:tasks [2012/01/18 12:59]
ufal vytvořeno
ufal:tasks [2012/01/19 12:13]
ufal
Line 1: Line 1:
 ====== Overview of NLP/CL tools available at UFAL ====== ====== Overview of NLP/CL tools available at UFAL ======
  
-Tokenization +===== Tokenization (word segmentation) ===== 
-Language Identification +Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
-Sentence Segmentation +
-Morphological Segmentation +
-Morphological Analysis +
-Part-of-Speech Tagging +
-Lemmatization +
-Analytical Parsing +
-Tectogrammatical Parsing +
-Named Entity Recognition +
-Machine Translation +
-Coreference resolution +
-Spell Checking +
-Text Similarity +
-Recasing +
-Rekonstrukce diakritiky+
  
 +=== Europarl tokenizer ===
 +  * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
 +  * **version:** v6 (Jan 2012) 
 +  * **author:** Philipp Koehn and Josh Schroeder
 +  * **licence:** free
 +  * **url:** http://www.statmt.org/europarl/
 +  * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
 +  * **efficiency**: NA 
 +  * **reference**: 
 +
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +
 +  * **contact:**
 +
 +
 +=== Europarl tokenizer ===
 +| **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
 +| **version:** | v6 (Jan 2012)  |
 +| **author:** | Philipp Koehn and Josh Schroeder |
 +| **licence:** | free |
 +| **url:** | http://www.statmt.org/europarl/ |
 +| **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
 +| **efficiency**: | NA  |
 +| **reference**: |
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +|
 +| **contact:** | |
 +
 +
 +===== Language Identification ======
 +
 +===== Sentence Segmentation =====
 +
 +===== Morphological Segmentation =====
 +
 +===== Morphological Analysis =====
 +
 +===== Part-of-Speech Tagging =====
 +
 +===== Lemmatization =====
 +
 +===== Analytical Parsing =====
 +
 +===== Tectogrammatical Parsing =====
 +
 +===== Named Entity Recognition =====
 +
 +===== Machine Translation =====
 +
 +===== Coreference resolution =====
 +
 +===== Spell Checking =====
 +
 +===== Text Similarity =====
 +
 +===== Recasing =====
 +
 +===== Diacritic Reconstruction =====
 +
 +====== Other tasks ======
 +
 +Word Sense Disambiguation
 +Relationship Extraction
 +Topic Segmentation
 +Information Retrieval
 +Information Extraction
 +Text Sumarization
 +Speech Reconstruction
 +Question Answering
 +Sentiment Analysis
  

[ Back to the navigation ] [ Back to the content ]