[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
ufal:tasks [2012/01/18 15:38]
ufal
ufal:tasks [2012/01/23 11:15] (current)
ufal
Line 5: Line 5:
  
 === Europarl tokenizer === === Europarl tokenizer ===
-  * **info:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.+  * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
   * **version:** v6 (Jan 2012)    * **version:** v6 (Jan 2012) 
   * **author:** Philipp Koehn and Josh Schroeder   * **author:** Philipp Koehn and Josh Schroeder
Line 12: Line 12:
   * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.   * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.
   * **efficiency**: NA    * **efficiency**: NA 
 +  * **reference**: 
 +
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +
   * **contact:**   * **contact:**
 +
 +
 +=== Europarl tokenizer ===
 +| **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
 +| **version:** | v6 (Jan 2012)  |
 +| **author:** | Philipp Koehn and Josh Schroeder |
 +| **licence:** | free |
 +| **url:** | http://www.statmt.org/europarl/ |
 +| **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
 +| **efficiency**: | NA  |
 +| **reference**: |
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +|
 +| **contact:** | |
 +
 +=== Tokenizers integrated in Treex ===
 +* rule-based (reg.exp.) tokenizers
 +* trainable tokenizer TextSeg
  
 ===== Language Identification ====== ===== Language Identification ======
 +Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf
  
 ===== Sentence Segmentation ===== ===== Sentence Segmentation =====
 +=== Segmenters integrated in Treex ===
 +* rule-based segmenters
 +* TextSeg (trainable)
  
 ===== Morphological Segmentation ===== ===== Morphological Segmentation =====
  
 ===== Morphological Analysis ===== ===== Morphological Analysis =====
 +=== Morphological Analyzers integrated in Treex ===
 +* Jan Hajič's Czech morphological analyzer
 +* toy analyzers for about ten languages (students' homeworks)
  
 ===== Part-of-Speech Tagging ===== ===== Part-of-Speech Tagging =====
 +
 +=== POS Taggers integrated in Treex ===
 +  * Featurama
 +  * Morce
 +  * MxPost tagger
 +  * Tree tagger
 +  * TnT tagger
 +  * Jan Hajič's tagger
 +  * a number of toy tagger prototypes (students' assignments) for about ten languages
 +
 +=== Details on Czech Tagging ===
 +A Guide to Czech Language Tagging at UFAL  http://ufal.mff.cuni.cz/czech-tagging/
  
 ===== Lemmatization ===== ===== Lemmatization =====
 +=== Lemmatizers integrated in Treex ===
 +* Martin Popel's lemmatizer for English
 +* a number of toy lemmatizers for about ten langauges (students' homeworks)
 +* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
  
 ===== Analytical Parsing ===== ===== Analytical Parsing =====
 +=== Analytical parsers integrated in Treex ===
 +* Ryan McDonald's MST parser
 +* Rudolf Rosa's MST parser
 +* MALT parser
 +* ZPar
 +* Stanford parser
 +
 +=== Details on Czech parsing ===
 +A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
 +
  
 ===== Tectogrammatical Parsing ===== ===== Tectogrammatical Parsing =====
 +=== Conversion of analytical trees to tectogrammatical trees integrated in Treex ===
 +* a scenario for rule-based tree transformation
 +* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
  
 ===== Named Entity Recognition ===== ===== Named Entity Recognition =====
 +=== NE recognizers integrated in Treex ===
 +* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
 +* Stanford Named Entity Recognizer for Czech
  
 ===== Machine Translation ===== ===== Machine Translation =====
 +
 +=== MT implemented in Treex ===
 +* elaborated English->Czech tecto-based translation
 +* prototype of Czech->English tecto-based translation
  
 ===== Coreference resolution ===== ===== Coreference resolution =====
 +=== Coreference resolvers integrated in Treex ===
 +* simple rule-based baseline resolvers for Czech and English
 +* Michal Novák's trainable resolvers
 +* Ngụy Giang Linh's trainable (perceptron-based] resolver
  
 ===== Spell Checking ===== ===== Spell Checking =====

[ Back to the navigation ] [ Back to the content ]