[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:ro [2012/01/12 12:56]
zeman vytvořeno
user:zeman:treebanks:ro [2012/01/12 17:11]
zeman References, domain and size.
Line 9: Line 9:
 ==== Obtaining and License ==== ==== Obtaining and License ====
  
-The syntactically annotated Romanian texts are available at http://www.phobos.ro/roric/texts/xml/. This is a bash script that will download the corpus:+The syntactically annotated Romanian texts are available at http://www.phobos.ro/roric/texts/xml/. This is a tcsh script that will download the corpus:
  
 <code bash>#!/bin/tcsh -f <code bash>#!/bin/tcsh -f
Line 26: Line 26:
  
 ==== References ==== ==== References ====
- 
-http://www.phobos.ro/roric/ 
-http://www.phobos.ro/roric/Ro/dg.html 
-http://www.phobos.ro/roric/Ro/DGA/dga.html 
-http://www.phobos.ro/roric/texts/indexro.html 
-http://www.phobos.ro/roric/texts/xml/ 
  
   * Website   * Website
-    * http://www.linguateca.pt/Floresta/principal.html (Floresta) +    * http://www.phobos.ro/roric/ 
-    * http://ilk.uvt.nl/conll/free_data.html (CoNLL 2006)+    * http://www.phobos.ro/roric/Ro/dg.html 
 +    * http://www.phobos.ro/roric/Ro/DGA/dga.html 
 +    * http://www.phobos.ro/roric/texts/indexro.html 
 +    * http://www.phobos.ro/roric/texts/xml/
   * Data   * Data
     * //no separate citation//     * //no separate citation//
   * Principal publications   * Principal publications
-    * Susana AfonsoEckhard Bick, Renato Haber, Diana Santos: [[http://www.linguateca.pt/Diana/download/AfonsoetalAPL2001.rtf|Floresta sintá(c)tica: um treebank para o português]]. In: Encontro da associação portuguesa de linguística, XVII, Lisboa, 2001. +    * Florentina HristeaMarius Popescu: [[http://www.phobos.ro/roric/papers/dgro.doc|Gramatici de dependenţă şi gramatici WG]], pp. 233-246.
-    * Cláudia Freitas, Paulo Rocha, Eckhard Bick: [[http://www.linguateca.pt/documentos/FreitasetAl2008Calidoscopio.pdf|Um mundo novo na Floresta Sintá(c)tica - o treebank para Português]]. Calidoscópio - Revista de Pós Graduação em Lingüística Aplicada da Unisinos, Rio Grande do Sul 6.3 (2008), pp. 142-148.+
   * Documentation   * Documentation
-    * [[http://www.linguateca.pt/Floresta/documentacao.html|Documentation]] 
-    * Cláudia Freitas, Susana Afonso: [[http://www.linguateca.pt/Floresta/BibliaFlorestal/|Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica]], 2008 
-    * [[http://www.linguateca.pt/Floresta/BibliaFlorestal/anexo1.html|Glossário de etiquetas florestais]] (glossary of tags) 
-    * [[http://www.linguateca.pt/Floresta/BibliaFlorestal/anexo4.html|Statistics of morphosyntactic tags]] 
  
 ==== Domain ==== ==== Domain ====
  
-Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000).+Newspaper.
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version contains 212,545 tokens in 9359 sentences, yielding 22.71 tokens per sentence on average (CoNLL 2006 data split: 206,678 tokens / 9071 sentences training, 5867 tokens / 288 sentences test).+The corpus contains 36150 tokens in 4042 clauses, yielding 8.94 tokens per clause on average. There is no official training-test data split. We use the files ''t1.xml'' – ''t10.xml'' (2640 tokens / 266 clauses) for testing and the rest (33510 tokens / 3776 clausesfor training of our HamleDT experiments.
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]