Differences

This shows you the differences between two versions of the page.

--- ufal:tasks [2012/01/18 12:59]
ufal vytvořeno
+++ ufal:tasks [2012/01/23 11:15] (current)
ufal
@@ Line 1: / Line 1: @@
 ====== Overview of NLP/CL tools available at UFAL ======
-Tokenization
+===== Tokenization (word segmentation) =====
-Language Identification
+Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
-Sentence Segmentation
-Morphological Segmentation
-Morphological Analysis
-Part-of-Speech Tagging
-Lemmatization
-Analytical Parsing
-Tectogrammatical Parsing
-Named Entity Recognition
-Machine Translation
-Coreference resolution
-Spell Checking
-Text Similarity
-Recasing
-Rekonstrukce diakritiky
+=== Europarl tokenizer ===
+  * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
+  * **version:** v6 (Jan 2012)
+  * **author:** Philipp Koehn and Josh Schroeder
+  * **licence:** free
+  * **url:** http://www.statmt.org/europarl/
+  * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
+  * **efficiency**: NA
+  * **reference**:
+  @inproceedings{Koehn:2005,
+  author = {Philipp Koehn},
+  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
+  pages = {79--86},
+  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
+  address = {Phuket, Thailand},
+  year = {2005}}
+  * **contact:**
+=== Europarl tokenizer ===
+| **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
+| **version:** | v6 (Jan 2012)  |
+| **author:** | Philipp Koehn and Josh Schroeder |
+| **licence:** | free |
+| **url:** | http://www.statmt.org/europarl/ |
+| **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
+| **efficiency**: | NA  |
+| **reference**: |
+  @inproceedings{Koehn:2005,
+  author = {Philipp Koehn},
+  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
+  pages = {79--86},
+  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
+  address = {Phuket, Thailand},
+  year = {2005}}
+|
+| **contact:** | |
+=== Tokenizers integrated in Treex ===
+* rule-based (reg.exp.) tokenizers
+* trainable tokenizer TextSeg
+===== Language Identification ======
+Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf
+===== Sentence Segmentation =====
+=== Segmenters integrated in Treex ===
+* rule-based segmenters
+* TextSeg (trainable)
+===== Morphological Segmentation =====
+===== Morphological Analysis =====
+=== Morphological Analyzers integrated in Treex ===
+* Jan Hajič's Czech morphological analyzer
+* toy analyzers for about ten languages (students' homeworks)
+===== Part-of-Speech Tagging =====
+=== POS Taggers integrated in Treex ===
+  * Featurama
+  * Morce
+  * MxPost tagger
+  * Tree tagger
+  * TnT tagger
+  * Jan Hajič's tagger
+  * a number of toy tagger prototypes (students' assignments) for about ten languages
+=== Details on Czech Tagging ===
+A Guide to Czech Language Tagging at UFAL  http://ufal.mff.cuni.cz/czech-tagging/
+===== Lemmatization =====
+=== Lemmatizers integrated in Treex ===
+* Martin Popel's lemmatizer for English
+* a number of toy lemmatizers for about ten langauges (students' homeworks)
+* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
+===== Analytical Parsing =====
+=== Analytical parsers integrated in Treex ===
+* Ryan McDonald's MST parser
+* Rudolf Rosa's MST parser
+* MALT parser
+* ZPar
+* Stanford parser
+=== Details on Czech parsing ===
+A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
+===== Tectogrammatical Parsing =====
+=== Conversion of analytical trees to tectogrammatical trees integrated in Treex ===
+* a scenario for rule-based tree transformation
+* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
+===== Named Entity Recognition =====
+=== NE recognizers integrated in Treex ===
+* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
+* Stanford Named Entity Recognizer for Czech
+===== Machine Translation =====
+=== MT implemented in Treex ===
+* elaborated English->Czech tecto-based translation
+* prototype of Czech->English tecto-based translation
+===== Coreference resolution =====
+=== Coreference resolvers integrated in Treex ===
+* simple rule-based baseline resolvers for Czech and English
+* Michal Novák's trainable resolvers
+* Ngụy Giang Linh's trainable (perceptron-based] resolver
+===== Spell Checking =====
+===== Text Similarity =====
+===== Recasing =====
+===== Diacritic Reconstruction =====
+====== Other tasks ======
+Word Sense Disambiguation
+Relationship Extraction
+Topic Segmentation
+Information Retrieval
+Information Extraction
+Text Sumarization
+Speech Reconstruction
+Question Answering
+Sentiment Analysis

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences