Overview of NLP/CL tools available at UFAL

Tokenization (word segmentation)

Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.

Europarl tokenizer

description: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
version: v6 (Jan 2012)
author: Philipp Koehn and Josh Schroeder
licence: free
url: http://www.statmt.org/europarl/
languages: in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
efficiency: NA
reference:

@inproceedings{Koehn:2005,
author = {Philipp Koehn},
booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
pages = {79--86},
title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
address = {Phuket, Thailand},
year = {2005}}

contact:

Europarl tokenizer

description:	A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
version:	v6 (Jan 2012)
author:	Philipp Koehn and Josh Schroeder
licence:	free
url:	http://www.statmt.org/europarl/
languages:	in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
efficiency:	NA
reference:	@inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}}
contact:

Tokenizers integrated in Treex

* rule-based (reg.exp.) tokenizers
* trainable tokenizer TextSeg

Language Identification

Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf

Sentence Segmentation

Segmenters integrated in Treex

* rule-based segmenters
* TextSeg (trainable)

Morphological Segmentation

Morphological Analysis

Morphological Analyzers integrated in Treex

* Jan Hajič's Czech morphological analyzer
* toy analyzers for about ten languages (students' homeworks)

Part-of-Speech Tagging

POS Taggers integrated in Treex

Featurama
Morce
MxPost tagger
Tree tagger
TnT tagger
Jan Hajič's tagger
a number of toy tagger prototypes (students' assignments) for about ten languages

Details on Czech Tagging

A Guide to Czech Language Tagging at UFAL http://ufal.mff.cuni.cz/czech-tagging/

Lemmatization

Lemmatizers integrated in Treex

* Martin Popel's lemmatizer for English
* a number of toy lemmatizers for about ten langauges (students' homeworks)
* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization

Analytical Parsing

Analytical parsers integrated in Treex

* Ryan McDonald's MST parser
* Rudolf Rosa's MST parser
* MALT parser
* ZPar
* Stanford parser

Details on Czech parsing

A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/

Tectogrammatical Parsing

Conversion of analytical trees to tectogrammatical trees integrated in Treex

* a scenario for rule-based tree transformation
* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT

Named Entity Recognition

NE recognizers integrated in Treex

* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
* Stanford Named Entity Recognizer for Czech

Machine Translation

MT implemented in Treex

* elaborated English→Czech tecto-based translation
* prototype of Czech→English tecto-based translation

Coreference resolution

Coreference resolvers integrated in Treex

* simple rule-based baseline resolvers for Czech and English
* Michal Novák's trainable resolvers
* Ngụy Giang Linh's trainable (perceptron-based] resolver

Spell Checking

Text Similarity

Recasing

Diacritic Reconstruction

Other tasks

Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents

Overview of NLP/CL tools available at UFAL

Tokenization (word segmentation)

Europarl tokenizer

Europarl tokenizer

Tokenizers integrated in Treex

Language Identification

Sentence Segmentation

Segmenters integrated in Treex

Morphological Segmentation

Morphological Analysis

Morphological Analyzers integrated in Treex

Part-of-Speech Tagging

POS Taggers integrated in Treex

Details on Czech Tagging

Lemmatization

Lemmatizers integrated in Treex

Analytical Parsing

Analytical parsers integrated in Treex

Details on Czech parsing

Tectogrammatical Parsing

Conversion of analytical trees to tectogrammatical trees integrated in Treex

Named Entity Recognition

NE recognizers integrated in Treex

Machine Translation

MT implemented in Treex

Coreference resolution

Coreference resolvers integrated in Treex

Spell Checking

Text Similarity

Recasing

Diacritic Reconstruction

Other tasks