This is an old revision of the document!
Table of Contents
Overview of NLP/CL tools available at UFAL
Tokenization (word segmentation)
Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
Europarl tokenizer
- info: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
- version: v6 (Jan 2012)
- author: Philipp Koehn and Josh Schroeder
- licence: free
- languages: applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
- efficiency: NA
- contact: