[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
ufal:tasks [2012/01/18 12:59]
ufal vytvořeno
ufal:tasks [2012/01/23 11:15] (current)
ufal
Line 1: Line 1:
 ====== Overview of NLP/CL tools available at UFAL ====== ====== Overview of NLP/CL tools available at UFAL ======
  
-Tokenization +===== Tokenization (word segmentation) ===== 
-Language Identification +Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
-Sentence Segmentation +
-Morphological Segmentation +
-Morphological Analysis +
-Part-of-Speech Tagging +
-Lemmatization +
-Analytical Parsing +
-Tectogrammatical Parsing +
-Named Entity Recognition +
-Machine Translation +
-Coreference resolution +
-Spell Checking +
-Text Similarity +
-Recasing +
-Rekonstrukce diakritiky+
  
 +=== Europarl tokenizer ===
 +  * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
 +  * **version:** v6 (Jan 2012) 
 +  * **author:** Philipp Koehn and Josh Schroeder
 +  * **licence:** free
 +  * **url:** http://www.statmt.org/europarl/
 +  * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
 +  * **efficiency**: NA 
 +  * **reference**: 
 +
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +
 +  * **contact:**
 +
 +
 +=== Europarl tokenizer ===
 +| **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
 +| **version:** | v6 (Jan 2012)  |
 +| **author:** | Philipp Koehn and Josh Schroeder |
 +| **licence:** | free |
 +| **url:** | http://www.statmt.org/europarl/ |
 +| **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
 +| **efficiency**: | NA  |
 +| **reference**: |
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +|
 +| **contact:** | |
 +
 +=== Tokenizers integrated in Treex ===
 +* rule-based (reg.exp.) tokenizers
 +* trainable tokenizer TextSeg
 +
 +===== Language Identification ======
 +Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf
 +
 +===== Sentence Segmentation =====
 +=== Segmenters integrated in Treex ===
 +* rule-based segmenters
 +* TextSeg (trainable)
 +
 +===== Morphological Segmentation =====
 +
 +===== Morphological Analysis =====
 +=== Morphological Analyzers integrated in Treex ===
 +* Jan Hajič's Czech morphological analyzer
 +* toy analyzers for about ten languages (students' homeworks)
 +
 +===== Part-of-Speech Tagging =====
 +
 +=== POS Taggers integrated in Treex ===
 +  * Featurama
 +  * Morce
 +  * MxPost tagger
 +  * Tree tagger
 +  * TnT tagger
 +  * Jan Hajič's tagger
 +  * a number of toy tagger prototypes (students' assignments) for about ten languages
 +
 +=== Details on Czech Tagging ===
 +A Guide to Czech Language Tagging at UFAL  http://ufal.mff.cuni.cz/czech-tagging/
 +
 +===== Lemmatization =====
 +=== Lemmatizers integrated in Treex ===
 +* Martin Popel's lemmatizer for English
 +* a number of toy lemmatizers for about ten langauges (students' homeworks)
 +* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
 +
 +===== Analytical Parsing =====
 +=== Analytical parsers integrated in Treex ===
 +* Ryan McDonald's MST parser
 +* Rudolf Rosa's MST parser
 +* MALT parser
 +* ZPar
 +* Stanford parser
 +
 +=== Details on Czech parsing ===
 +A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
 +
 +
 +===== Tectogrammatical Parsing =====
 +=== Conversion of analytical trees to tectogrammatical trees integrated in Treex ===
 +* a scenario for rule-based tree transformation
 +* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
 +
 +===== Named Entity Recognition =====
 +=== NE recognizers integrated in Treex ===
 +* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
 +* Stanford Named Entity Recognizer for Czech
 +
 +===== Machine Translation =====
 +
 +=== MT implemented in Treex ===
 +* elaborated English->Czech tecto-based translation
 +* prototype of Czech->English tecto-based translation
 +
 +===== Coreference resolution =====
 +=== Coreference resolvers integrated in Treex ===
 +* simple rule-based baseline resolvers for Czech and English
 +* Michal Novák's trainable resolvers
 +* Ngụy Giang Linh's trainable (perceptron-based] resolver
 +
 +===== Spell Checking =====
 +
 +===== Text Similarity =====
 +
 +===== Recasing =====
 +
 +===== Diacritic Reconstruction =====
 +
 +====== Other tasks ======
 +
 +Word Sense Disambiguation
 +Relationship Extraction
 +Topic Segmentation
 +Information Retrieval
 +Information Extraction
 +Text Sumarization
 +Speech Reconstruction
 +Question Answering
 +Sentiment Analysis
  

[ Back to the navigation ] [ Back to the content ]