[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Overview of NLP/CL tools available at UFAL

Tokenization (word segmentation)

Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.

Europarl tokenizer

@inproceedings{Koehn:2005,
author = {Philipp Koehn},
booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
pages = {79--86},
title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
address = {Phuket, Thailand},
year = {2005}}

Europarl tokenizer

description: A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
version: v6 (Jan 2012)
author: Philipp Koehn and Josh Schroeder
licence: free
url: http://www.statmt.org/europarl/
languages: in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
efficiency: NA
reference:
@inproceedings{Koehn:2005,
author = {Philipp Koehn},
booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
pages = {79--86},
title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
address = {Phuket, Thailand},
year = {2005}}
contact:

Tokenizers integrated in Treex

* rule-based (reg.exp.) tokenizers
* trainable tokenizer TextSeg

Language Identification

Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf

Sentence Segmentation

Segmenters integrated in Treex

* rule-based segmenters
* TextSeg (trainable)

Morphological Segmentation

Morphological Analysis

Morphological Analyzers integrated in Treex

* Jan Hajič's Czech morphological analyzer
* toy analyzers for about ten languages (students' homeworks)

Part-of-Speech Tagging

POS Taggers integrated in Treex

Details on Czech Tagging

A Guide to Czech Language Tagging at UFAL http://ufal.mff.cuni.cz/czech-tagging/

Lemmatization

Lemmatizers integrated in Treex

* Martin Popel's lemmatizer for English
* a number of toy lemmatizers for about ten langauges (students' homeworks)
* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization

Analytical Parsing

Analytical parsers integrated in Treex

* Ryan McDonald's MST parser
* Rudolf Rosa's MST parser
* MALT parser
* ZPar
* Stanford parser

Details on Czech parsing

A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/

Tectogrammatical Parsing

Conversion of analytical trees to tectogrammatical trees integrated in Treex

* a scenario for rule-based tree transformation
* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT

Named Entity Recognition

NE recognizers integrated in Treex

* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
* Stanford Named Entity Recognizer for Czech

Machine Translation

MT implemented in Treex

* elaborated English→Czech tecto-based translation
* prototype of Czech→English tecto-based translation

Coreference resolution

Coreference resolvers integrated in Treex

* simple rule-based baseline resolvers for Czech and English
* Michal Novák's trainable resolvers
* Ngụy Giang Linh's trainable (perceptron-based] resolver

Spell Checking

Text Similarity

Recasing

Diacritic Reconstruction

Other tasks

Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis


[ Back to the navigation ] [ Back to the content ]