====== Overview of NLP/CL tools available at UFAL ======

===== Tokenization (word segmentation) =====
Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.

=== Europarl tokenizer ===
  * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
  * **version:** v6 (Jan 2012) 
  * **author:** Philipp Koehn and Josh Schroeder
  * **licence:** free
  * **url:** http://www.statmt.org/europarl/
  * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
  * **efficiency**: NA 
  * **reference**: 

  @inproceedings{Koehn:2005,
  author = {Philipp Koehn},
  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
  pages = {79--86},
  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
  address = {Phuket, Thailand},
  year = {2005}}

  * **contact:**


=== Europarl tokenizer ===
| **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
| **version:** | v6 (Jan 2012)  |
| **author:** | Philipp Koehn and Josh Schroeder |
| **licence:** | free |
| **url:** | http://www.statmt.org/europarl/ |
| **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
| **efficiency**: | NA  |
| **reference**: |
  @inproceedings{Koehn:2005,
  author = {Philipp Koehn},
  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
  pages = {79--86},
  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
  address = {Phuket, Thailand},
  year = {2005}}
|
| **contact:** | |

=== Tokenizers integrated in Treex ===
* rule-based (reg.exp.) tokenizers
* trainable tokenizer TextSeg

===== Language Identification ======
Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf

===== Sentence Segmentation =====
=== Segmenters integrated in Treex ===
* rule-based segmenters
* TextSeg (trainable)

===== Morphological Segmentation =====

===== Morphological Analysis =====
=== Morphological Analyzers integrated in Treex ===
* Jan Hajič's Czech morphological analyzer
* toy analyzers for about ten languages (students' homeworks)

===== Part-of-Speech Tagging =====

=== POS Taggers integrated in Treex ===
  * Featurama
  * Morce
  * MxPost tagger
  * Tree tagger
  * TnT tagger
  * Jan Hajič's tagger
  * a number of toy tagger prototypes (students' assignments) for about ten languages

=== Details on Czech Tagging ===
A Guide to Czech Language Tagging at UFAL  http://ufal.mff.cuni.cz/czech-tagging/

===== Lemmatization =====
=== Lemmatizers integrated in Treex ===
* Martin Popel's lemmatizer for English
* a number of toy lemmatizers for about ten langauges (students' homeworks)
* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization

===== Analytical Parsing =====
=== Analytical parsers integrated in Treex ===
* Ryan McDonald's MST parser
* Rudolf Rosa's MST parser
* MALT parser
* ZPar
* Stanford parser

=== Details on Czech parsing ===
A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/


===== Tectogrammatical Parsing =====
=== Conversion of analytical trees to tectogrammatical trees integrated in Treex ===
* a scenario for rule-based tree transformation
* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT

===== Named Entity Recognition =====
=== NE recognizers integrated in Treex ===
* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
* Stanford Named Entity Recognizer for Czech

===== Machine Translation =====

=== MT implemented in Treex ===
* elaborated English->Czech tecto-based translation
* prototype of Czech->English tecto-based translation

===== Coreference resolution =====
=== Coreference resolvers integrated in Treex ===
* simple rule-based baseline resolvers for Czech and English
* Michal Novák's trainable resolvers
* Ngụy Giang Linh's trainable (perceptron-based] resolver

===== Spell Checking =====

===== Text Similarity =====

===== Recasing =====

===== Diacritic Reconstruction =====

====== Other tasks ======

Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis