This is an old revision of the document!
Table of Contents
Overview of NLP/CL tools available at UFAL
Tokenization (word segmentation)
Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
Europarl tokenizer
- description: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
- version: v6 (Jan 2012)
- author: Philipp Koehn and Josh Schroeder
- licence: free
- languages: in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
- efficiency: NA
- reference:
@inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}}
- contact:
Europarl tokenizer
description: | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
version: | v6 (Jan 2012) |
author: | Philipp Koehn and Josh Schroeder |
licence: | free |
url: | http://www.statmt.org/europarl/ |
languages: | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
efficiency: | NA |
reference: | @inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}} |
contact: |
Language Identification
Sentence Segmentation
Morphological Segmentation
Morphological Analysis
Part-of-Speech Tagging
POS Taggers integrated in Treex
- Featurama
- Morce
- MxPost tagger
- Tree tagger
- TnT tagger
- Jan Hajič's tagger
- a number of toy tagger prototypes (students' assignments) for about ten languages
Details on Czech Tagging
A Guide to Czech Language Tagging at UFAL http://ufal.mff.cuni.cz/czech-tagging/
Lemmatization
Lemmatizers integrated in Treex
* Martin Popel's lemmatizer for English
* a number of toy lemmatizers for about ten langauges (students' homeworks)
* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
Analytical Parsing
Analytical parsers integrated in Treex
* Ryan McDonald's MST parser
* Rudolf Rosa's MST parser
* MALT parser
* ZPar
* Stanford parser
Details on Czech parsing
A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
Tectogrammatical Parsing
Conversion of analytical trees to tectogrammatical trees integrated in Treex
* a scenario for rule-based tree transformation
* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
Named Entity Recognition
Machine Translation
Coreference resolution
Spell Checking
Text Similarity
Recasing
Diacritic Reconstruction
Other tasks
Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis