This is an old revision of the document!
−Table of Contents
Overview of NLP/CL tools available at UFAL
Tokenization (word segmentation)
Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
Europarl tokenizer
- info: A sample tokenizer
- author: Philipp Koehn and Josh Schroeder
- licensing: free
- languages: applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
- contact:
Language Identification
Sentence Segmentation
Morphological Segmentation
Morphological Analysis
Part-of-Speech Tagging
Lemmatization
Analytical Parsing
Tectogrammatical Parsing
Named Entity Recognition
Machine Translation
Coreference resolution
Spell Checking
Text Similarity
Recasing
Rekonstrukce diakritiky
[ Back to the navigation ] [ Back to the content ]