Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult.
@inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}}
description: | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
version: | v6 (Jan 2012) |
author: | Philipp Koehn and Josh Schroeder |
licence: | free |
url: | http://www.statmt.org/europarl/ |
languages: | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
efficiency: | NA |
reference: | @inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}} |
contact: |
* rule-based (reg.exp.) tokenizers
* trainable tokenizer TextSeg
Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf
* rule-based segmenters
* TextSeg (trainable)
* Jan Hajič's Czech morphological analyzer
* toy analyzers for about ten languages (students' homeworks)
A Guide to Czech Language Tagging at UFAL http://ufal.mff.cuni.cz/czech-tagging/
* Martin Popel's lemmatizer for English
* a number of toy lemmatizers for about ten langauges (students' homeworks)
* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
* Ryan McDonald's MST parser
* Rudolf Rosa's MST parser
* MALT parser
* ZPar
* Stanford parser
A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
* a scenario for rule-based tree transformation
* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
* Stanford Named Entity Recognizer for Czech
* elaborated English→Czech tecto-based translation
* prototype of Czech→English tecto-based translation
* simple rule-based baseline resolvers for Czech and English
* Michal Novák's trainable resolvers
* Ngụy Giang Linh's trainable (perceptron-based] resolver
Word Sense Disambiguation
Relationship Extraction
Topic Segmentation
Information Retrieval
Information Extraction
Text Sumarization
Speech Reconstruction
Question Answering
Sentiment Analysis