This is an old revision of the document!
Table of Contents
Overview of NLP/CL tools available at UFAL
Tokenization (word segmentation)
Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
Europarl tokenizer
- info: A sample tokenizer, distributed as a part of the Europarl tools
- version: v6 (Jan 2012)
- author: Philipp Koehn and Josh Schroeder
- licence: free
- languages: applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
- performance:
- contact: