[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision Both sides next revision
user:zeman:treebanks:ro [2012/01/12 12:59]
zeman References.
user:zeman:treebanks:ro [2012/01/12 17:11]
zeman References, domain and size.
Line 36: Line 36:
     * //no separate citation//     * //no separate citation//
   * Principal publications   * Principal publications
-    * Susana AfonsoEckhard Bick, Renato Haber, Diana Santos: [[http://​www.linguateca.pt/Diana/download/AfonsoetalAPL2001.rtf|Floresta sintá(c)tica:​ um treebank para o português]]. In: Encontro da associação portuguesa ​de linguística,​ XVII, Lisboa, 2001. +    * Florentina HristeaMarius Popescu: [[http://​www.phobos.ro/roric/papers/dgro.doc|Gramatici ​de dependenţă şi gramatici WG]], pp. 233-246.
-    * Cláudia Freitas, Paulo Rocha, Eckhard Bick: [[http://​www.linguateca.pt/​documentos/​FreitasetAl2008Calidoscopio.pdf|Um mundo novo na Floresta Sintá(c)tica - o treebank para Português]]. Calidoscópio - Revista de Pós Graduação em Lingüística Aplicada da Unisinos, Rio Grande do Sul 6.3 (2008), pp. 142-148.+
   * Documentation   * Documentation
-    * [[http://​www.linguateca.pt/​Floresta/​documentacao.html|Documentation]] 
-    * Cláudia Freitas, Susana Afonso: [[http://​www.linguateca.pt/​Floresta/​BibliaFlorestal/​|Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica]],​ 2008 
-    * [[http://​www.linguateca.pt/​Floresta/​BibliaFlorestal/​anexo1.html|Glossário de etiquetas florestais]] (glossary of tags) 
-    * [[http://​www.linguateca.pt/​Floresta/​BibliaFlorestal/​anexo4.html|Statistics of morphosyntactic tags]] 
  
 ==== Domain ==== ==== Domain ====
  
-Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/​Público,​ texts from the Portuguese (European) journal Público, April 2000).+Newspaper.
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version ​contains ​212,​545 ​tokens in 9359 sentences, yielding ​22.71 tokens per sentence ​on average ​(CoNLL 2006 data split: 206,​678 ​tokens / 9071 sentences training, 5867 tokens / 288 sentences test).+The corpus ​contains ​36150 tokens in 4042 clauses, yielding ​8.94 tokens per clause ​on average. There is no official training-test ​data split. We use the files ''​t1.xml''​ – ''​t10.xml''​ (2640 tokens / 266 clauses) for testing and the rest (33510 ​tokens / 3776 clausesfor training of our HamleDT experiments.
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]