Overview of NLP/CL tools available at UFAL

Tokenization (word segmentation)

Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.

Europarl tokenizer

info: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
version: v6 (Jan 2012)
author: Philipp Koehn and Josh Schroeder
licence: free
url: http://www.statmt.org/europarl/
languages: applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
efficiency: NA
contact:

Language Identification

Sentence Segmentation

Morphological Segmentation

Morphological Analysis

Part-of-Speech Tagging

Lemmatization

Analytical Parsing

Tectogrammatical Parsing

Named Entity Recognition

Machine Translation

Coreference resolution

Spell Checking

Text Similarity

Recasing

Rekonstrukce diakritiky

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents