This is an old revision of the document!
Table of Contents
Overview of NLP/CL tools available at UFAL
Tokenization (word segmentation)
Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
Europarl tokenizer
- description: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
- version: v6 (Jan 2012)
- author: Philipp Koehn and Josh Schroeder
- licence: free
- languages: in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
- efficiency: NA
- reference:
@inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}}
- contact:
Europarl tokenizer
description: | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
version: | v6 (Jan 2012) |
author: | Philipp Koehn and Josh Schroeder |
licence: | free |
url: | http://www.statmt.org/europarl/ |
languages: | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
efficiency: | NA |
reference: | @inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}} |
contact: |
Tokenizers integrated in Treex
* rule-based (reg.exp.) tokenizers
* trainable tokenizer TextSeg
Language Identification
Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf
Sentence Segmentation
Morphological Segmentation
Morphological Analysis
Morphological Analyzers integrated in Treex
* Jan Hajič's Czech morphological analyzer
* toy analyzers for about ten languages (students' homeworks)
Part-of-Speech Tagging
POS Taggers integrated in Treex
- Featurama
- Morce
- MxPost tagger
- Tree tagger
- TnT tagger
- Jan Hajič's tagger
- a number of toy tagger prototypes (students' assignments) for about ten languages
Details on Czech Tagging
A Guide to Czech Language Tagging at UFAL http://ufal.mff.cuni.cz/czech-tagging/
Lemmatization
Lemmatizers integrated in Treex
* Martin Popel's lemmatizer for English
* a number of toy lemmatizers for about ten langauges (students' homeworks)
* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
Analytical Parsing
Analytical parsers integrated in Treex
* Ryan McDonald's MST parser
* Rudolf Rosa's MST parser
* MALT parser
* ZPar
* Stanford parser
Details on Czech parsing
A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
Tectogrammatical Parsing
Conversion of analytical trees to tectogrammatical trees integrated in Treex
* a scenario for rule-based tree transformation
* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
Named Entity Recognition
NE recognizers integrated in Treex
* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
* Stanford Named Entity Recognizer for Czech
Machine Translation
MT implemented in Treex
* elaborated English→Czech tecto-based translation
* prototype of Czech→English tecto-based translation
Coreference resolution
Coreference resolvers integrated in Treex
* simple rule-based baseline resolvers for Czech and English
* Michal Novák's trainable resolvers
* Ngụy Giang Linh's trainable (perceptron-based] resolver
Spell Checking
Text Similarity
Recasing
Diacritic Reconstruction
Other tasks
Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis