This is an old revision of the document!

Overview of NLP/CL tools available at UFAL

Tokenization (word segmentation)

Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.

Europarl tokenizer

description: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
version: v6 (Jan 2012)
author: Philipp Koehn and Josh Schroeder
licence: free
url: http://www.statmt.org/europarl/
languages: in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
efficiency: NA
reference:
contact:

Language Identification

Sentence Segmentation

Morphological Segmentation

Morphological Analysis

Part-of-Speech Tagging

Lemmatization

Analytical Parsing

Tectogrammatical Parsing

Named Entity Recognition

Machine Translation

Coreference resolution

Spell Checking

Text Similarity

Recasing

Diacritic Reconstruction

Other tasks

Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents