[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
ufal:tasks [2012/01/18 13:41]
ufal
ufal:tasks [2012/01/23 11:15]
ufal
Line 5: Line 5:
  
 === Europarl tokenizer === === Europarl tokenizer ===
-  * **info:** A sample tokenizer, distributed as a part of the Europarl tools+  * **description:** A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools.
   * **version:** v6 (Jan 2012)    * **version:** v6 (Jan 2012) 
   * **author:** Philipp Koehn and Josh Schroeder   * **author:** Philipp Koehn and Josh Schroeder
   * **licence:** free   * **licence:** free
   * **url:** http://www.statmt.org/europarl/   * **url:** http://www.statmt.org/europarl/
-  * **languages:** applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV.+  * **languages:** in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. 
 +  * **efficiency**: NA  
 +  * **reference**:  
 + 
 +  @inproceedings{Koehn:2005, 
 +  author = {Philipp Koehn}, 
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}}, 
 +  pages = {79--86}, 
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}}, 
 +  address = {Phuket, Thailand}, 
 +  year = {2005}} 
   * **contact:**   * **contact:**
 +
 +
 +=== Europarl tokenizer ===
 +| **description:** | A sample rule-based tokenizer, can use a list of prefixes which are usually followed by a dot but don't break a sentence. Distributed as a part of the Europarl tools. |
 +| **version:** | v6 (Jan 2012)  |
 +| **author:** | Philipp Koehn and Josh Schroeder |
 +| **licence:** | free |
 +| **url:** | http://www.statmt.org/europarl/ |
 +| **languages:** | in principle applicable to all languages using space-separated words; nonbreaking prefixes available for DE, EL, EN, ES, FR, IT, PT, SV. |
 +| **efficiency**: | NA  |
 +| **reference**: |
 +  @inproceedings{Koehn:2005,
 +  author = {Philipp Koehn},
 +  booktitle = {{Conference Proceedings: the tenth Machine Translation Summit}},
 +  pages = {79--86},
 +  title = {{Europarl: A Parallel Corpus for Statistical Machine Translation}},
 +  address = {Phuket, Thailand},
 +  year = {2005}}
 +|
 +| **contact:** | |
 +
 +=== Tokenizers integrated in Treex ===
 +* rule-based (reg.exp.) tokenizers
 +* trainable tokenizer TextSeg
  
 ===== Language Identification ====== ===== Language Identification ======
 +Martin Majliš's language identifier (covers about 100 languages) http://wiki.ufal.ms.mff.cuni.cz/~majlis/publications/master-thesis.pdf
  
 ===== Sentence Segmentation ===== ===== Sentence Segmentation =====
Line 20: Line 56:
  
 ===== Morphological Analysis ===== ===== Morphological Analysis =====
 +=== Morphological Analyzers integrated in Treex ===
 +* Jan Hajič's Czech morphological analyzer
 +* toy analyzers for about ten languages (students' homeworks)
  
 ===== Part-of-Speech Tagging ===== ===== Part-of-Speech Tagging =====
 +
 +=== POS Taggers integrated in Treex ===
 +  * Featurama
 +  * Morce
 +  * MxPost tagger
 +  * Tree tagger
 +  * TnT tagger
 +  * Jan Hajič's tagger
 +  * a number of toy tagger prototypes (students' assignments) for about ten languages
 +
 +=== Details on Czech Tagging ===
 +A Guide to Czech Language Tagging at UFAL  http://ufal.mff.cuni.cz/czech-tagging/
  
 ===== Lemmatization ===== ===== Lemmatization =====
 +=== Lemmatizers integrated in Treex ===
 +* Martin Popel's lemmatizer for English
 +* a number of toy lemmatizers for about ten langauges (students' homeworks)
 +* for Czech, lemmatization is traditionally treated as a part of POS disambiguations, so almost all Czech taggers are capable of lemmatization
  
 ===== Analytical Parsing ===== ===== Analytical Parsing =====
 +=== Analytical parsers integrated in Treex ===
 +* Ryan McDonald's MST parser
 +* Rudolf Rosa's MST parser
 +* MALT parser
 +* ZPar
 +* Stanford parser
 +
 +=== Details on Czech parsing ===
 +A Complete Guide to Czech Language Parsing http://ufal.mff.cuni.cz/czech-parsing/
 +
  
 ===== Tectogrammatical Parsing ===== ===== Tectogrammatical Parsing =====
 +=== Conversion of analytical trees to tectogrammatical trees integrated in Treex ===
 +* a scenario for rule-based tree transformation
 +* Ondřej Dušek's tools for functor assignment trained on PDT and PCEDT
  
 ===== Named Entity Recognition ===== ===== Named Entity Recognition =====
 +=== NE recognizers integrated in Treex ===
 +* Jana Straková's SVM based recognizer for Czech http://www.aclweb.org/anthology/W/W09/W09-3538.pdf
 +* Stanford Named Entity Recognizer for Czech
  
 ===== Machine Translation ===== ===== Machine Translation =====
 +
 +=== MT implemented in Treex ===
 +* elaborated English->Czech tecto-based translation
 +* prototype of Czech->English tecto-based translation
  
 ===== Coreference resolution ===== ===== Coreference resolution =====
 +=== Coreference resolvers integrated in Treex ===
 +* simple rule-based baseline resolvers for Czech and English
 +* Michal Novák's trainable resolvers
 +* Ngụy Giang Linh's trainable (perceptron-based] resolver
  
 ===== Spell Checking ===== ===== Spell Checking =====
Line 41: Line 120:
 ===== Recasing ===== ===== Recasing =====
  
-===== Rekonstrukce diakritiky =====+===== Diacritic Reconstruction ===== 
 + 
 +====== Other tasks ======
  
 +Word Sense Disambiguation
 +Relationship Extraction
 +Topic Segmentation
 +Information Retrieval
 +Information Extraction
 +Text Sumarization
 +Speech Reconstruction
 +Question Answering
 +Sentiment Analysis
  

[ Back to the navigation ] [ Back to the content ]