====== Overview of NLP/CL tools available at UFAL ====== ===== Tokenization (word segmentation) ===== Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult. === Europarl tokenizer === * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. * **version:** v6 (Jan 2012) * **author:** Philipp Koehn and Josh Schroeder * **licence:** free * **url:** http://www.statmt.org/europarl/ * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. * **efficiency**: NA * **reference**: @inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}} * **contact:** === Europarl tokenizer === | **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. | | **version:** | v6 (Jan 2012) | | **author:** | Philipp Koehn and Josh Schroeder | | **licence:** | free | | **url:** | http://www.statmt.org/europarl/ | | **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. | | **efficiency**: | NA | | **reference**: | @inproceedings{Koehn:2005, author = {Philipp Koehn}, booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, pages = {79--86}, title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, address = {Phuket, Thailand}, year = {2005}} | | **contact:** | | === Tokenizers integrated in Treex === * rule-based (reg.exp.) tokenizers * trainable tokenizer TextSeg ===== Language Identification ====== Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf ===== Sentence Segmentation ===== === Segmenters integrated in Treex === * rule-based segmenters * TextSeg (trainable) ===== Morphological Segmentation ===== ===== Morphological Analysis ===== === Morphological Analyzers integrated in Treex === * Jan Hajič's Czech morphological analyzer * toy analyzers for about ten languages (students' homeworks) ===== Part-of-Speech Tagging ===== === POS Taggers integrated in Treex === * Featurama * Morce * MxPost tagger * Tree tagger * TnT tagger * Jan Hajič's tagger * a number of toy tagger prototypes (students' assignments) for about ten languages === Details on Czech Tagging === A Guide to Czech Language Tagging at UFAL http://ufal.mff.cuni.cz/czech-tagging/ ===== Lemmatization ===== === Lemmatizers integrated in Treex === * Martin Popel's lemmatizer for English * a number of toy lemmatizers for about ten langauges (students' homeworks) * for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization ===== Analytical Parsing ===== === Analytical parsers integrated in Treex === * Ryan McDonald's MST parser * Rudolf Rosa's MST parser * MALT parser * ZPar * Stanford parser === Details on Czech parsing === A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/ ===== Tectogrammatical Parsing ===== === Conversion of analytical trees to tectogrammatical trees integrated in Treex === * a scenario for rule-based tree transformation * Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT ===== Named Entity Recognition ===== === NE recognizers integrated in Treex === * Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf * Stanford Named Entity Recognizer for Czech ===== Machine Translation ===== === MT implemented in Treex === * elaborated English->Czech tecto-based translation * prototype of Czech->English tecto-based translation ===== Coreference resolution ===== === Coreference resolvers integrated in Treex === * simple rule-based baseline resolvers for Czech and English * Michal Novák's trainable resolvers * Ngụy Giang Linh's trainable (perceptron-based] resolver ===== Spell Checking ===== ===== Text Similarity ===== ===== Recasing ===== ===== Diacritic Reconstruction ===== ====== Other tasks ====== Word Sense Disambiguation Relationship Extraction Topic Segmentation Information Retrieval Information Extraction Text Sumarization Speech Reconstruction Question Answering Sentiment Analysis