Differences
This shows you the differences between two versions of the page.
| Next revision | Previous revision | ||
|
ufal:tasks [2012/01/18 12:59] ufal vytvořeno |
ufal:tasks [2012/01/23 11:15] (current) ufal |
||
|---|---|---|---|
| Line 1: | Line 1: | ||
| ====== Overview of NLP/CL tools available at UFAL ====== | ====== Overview of NLP/CL tools available at UFAL ====== | ||
| - | Tokenization | + | ===== Tokenization |
| - | Language Identification | + | Segmentation of text into tokens (words, punctuation marks, etc.). For languages using space-separated words (English. Czech, etc), the taks is relatively easy. For other languages (Chinese, Japanese, etc.) the task is much more difficult. |
| - | Sentence | + | |
| - | Morphological Segmentation | + | |
| - | Morphological Analysis | + | |
| - | Part-of-Speech Tagging | + | |
| - | Lemmatization | + | |
| - | Analytical Parsing | + | |
| - | Tectogrammatical Parsing | + | |
| - | Named Entity Recognition | + | |
| - | Machine Translation | + | |
| - | Coreference resolution | + | |
| - | Spell Checking | + | |
| - | Text Similarity | + | |
| - | Recasing | + | |
| - | Rekonstrukce diakritiky | + | |
| + | === Europarl tokenizer === | ||
| + | * **description: | ||
| + | * **version: | ||
| + | * **author:** Philipp Koehn and Josh Schroeder | ||
| + | * **licence: | ||
| + | * **url:** http:// | ||
| + | * **languages: | ||
| + | * **efficiency**: | ||
| + | * **reference**: | ||
| + | |||
| + | @inproceedings{Koehn: | ||
| + | author = {Philipp Koehn}, | ||
| + | booktitle = {{Conference Proceedings: | ||
| + | pages = {79--86}, | ||
| + | title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, | ||
| + | address = {Phuket, Thailand}, | ||
| + | year = {2005}} | ||
| + | |||
| + | * **contact: | ||
| + | |||
| + | |||
| + | === Europarl tokenizer === | ||
| + | | **description: | ||
| + | | **version: | ||
| + | | **author:** | Philipp Koehn and Josh Schroeder | | ||
| + | | **licence: | ||
| + | | **url:** | http:// | ||
| + | | **languages: | ||
| + | | **efficiency**: | ||
| + | | **reference**: | ||
| + | @inproceedings{Koehn: | ||
| + | author = {Philipp Koehn}, | ||
| + | booktitle = {{Conference Proceedings: | ||
| + | pages = {79--86}, | ||
| + | title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, | ||
| + | address = {Phuket, Thailand}, | ||
| + | year = {2005}} | ||
| + | | | ||
| + | | **contact: | ||
| + | |||
| + | === Tokenizers integrated in Treex === | ||
| + | * rule-based (reg.exp.) tokenizers | ||
| + | * trainable tokenizer TextSeg | ||
| + | |||
| + | ===== Language Identification ====== | ||
| + | Martin Majliš' | ||
| + | |||
| + | ===== Sentence Segmentation ===== | ||
| + | === Segmenters integrated in Treex === | ||
| + | * rule-based segmenters | ||
| + | * TextSeg (trainable) | ||
| + | |||
| + | ===== Morphological Segmentation ===== | ||
| + | |||
| + | ===== Morphological Analysis ===== | ||
| + | === Morphological Analyzers integrated in Treex === | ||
| + | * Jan Hajič' | ||
| + | * toy analyzers for about ten languages (students' | ||
| + | |||
| + | ===== Part-of-Speech Tagging ===== | ||
| + | |||
| + | === POS Taggers integrated in Treex === | ||
| + | * Featurama | ||
| + | * Morce | ||
| + | * MxPost tagger | ||
| + | * Tree tagger | ||
| + | * TnT tagger | ||
| + | * Jan Hajič' | ||
| + | * a number of toy tagger prototypes (students' | ||
| + | |||
| + | === Details on Czech Tagging === | ||
| + | A Guide to Czech Language Tagging at UFAL http:// | ||
| + | |||
| + | ===== Lemmatization ===== | ||
| + | === Lemmatizers integrated in Treex === | ||
| + | * Martin Popel' | ||
| + | * a number of toy lemmatizers for about ten langauges (students' | ||
| + | * for Czech, lemmatization is traditionally treated as a part of POS disambiguations, | ||
| + | |||
| + | ===== Analytical Parsing ===== | ||
| + | === Analytical parsers integrated in Treex === | ||
| + | * Ryan McDonald' | ||
| + | * Rudolf Rosa's MST parser | ||
| + | * MALT parser | ||
| + | * ZPar | ||
| + | * Stanford parser | ||
| + | |||
| + | === Details on Czech parsing === | ||
| + | A Complete Guide to Czech Language Parsing http:// | ||
| + | |||
| + | |||
| + | ===== Tectogrammatical Parsing ===== | ||
| + | === Conversion of analytical trees to tectogrammatical trees integrated in Treex === | ||
| + | * a scenario for rule-based tree transformation | ||
| + | * Ondřej Dušek' | ||
| + | |||
| + | ===== Named Entity Recognition ===== | ||
| + | === NE recognizers integrated in Treex === | ||
| + | * Jana Straková' | ||
| + | * Stanford Named Entity Recognizer for Czech | ||
| + | |||
| + | ===== Machine Translation ===== | ||
| + | |||
| + | === MT implemented in Treex === | ||
| + | * elaborated English-> | ||
| + | * prototype of Czech-> | ||
| + | |||
| + | ===== Coreference resolution ===== | ||
| + | === Coreference resolvers integrated in Treex === | ||
| + | * simple rule-based baseline resolvers for Czech and English | ||
| + | * Michal Novák' | ||
| + | * Ngụy Giang Linh's trainable (perceptron-based] resolver | ||
| + | |||
| + | ===== Spell Checking ===== | ||
| + | |||
| + | ===== Text Similarity ===== | ||
| + | |||
| + | ===== Recasing ===== | ||
| + | |||
| + | ===== Diacritic Reconstruction ===== | ||
| + | |||
| + | ====== Other tasks ====== | ||
| + | |||
| + | Word Sense Disambiguation | ||
| + | Relationship Extraction | ||
| + | Topic Segmentation | ||
| + | Information Retrieval | ||
| + | Information Extraction | ||
| + | Text Sumarization | ||
| + | Speech Reconstruction | ||
| + | Question Answering | ||
| + | Sentiment Analysis | ||
