Differences

This shows you the differences between two versions of the page.

--- user:zeman:treebanks:ro [2012/01/12 12:59]
zeman References.
+++ user:zeman:treebanks:ro [2012/01/12 17:11]
zeman References, domain and size.
@@ Line 36: / Line 36: @@
     * //no separate citation//
   * Principal publications
-    * Susana Afonso, Eckhard Bick, Renato Haber, Diana Santos: [[http://www.linguateca.pt/Diana/download/AfonsoetalAPL2001.rtf|Floresta sintá(c)tica: um treebank para o português]]. In: Encontro da associação portuguesa de linguística, XVII, Lisboa, 2001.
+    * Florentina Hristea, Marius Popescu: [[http://www.phobos.ro/roric/papers/dgro.doc|Gramatici de dependenţă şi gramatici WG]], pp. 233-246.
-    * Cláudia Freitas, Paulo Rocha, Eckhard Bick: [[http://www.linguateca.pt/documentos/FreitasetAl2008Calidoscopio.pdf|Um mundo novo na Floresta Sintá(c)tica - o treebank para Português]]. Calidoscópio - Revista de Pós Graduação em Lingüística Aplicada da Unisinos, Rio Grande do Sul 6.3 (2008), pp. 142-148.
   * Documentation
-    * [[http://www.linguateca.pt/Floresta/documentacao.html|Documentation]]
-    * Cláudia Freitas, Susana Afonso: [[http://www.linguateca.pt/Floresta/BibliaFlorestal/|Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica]], 2008
-    * [[http://www.linguateca.pt/Floresta/BibliaFlorestal/anexo1.html|Glossário de etiquetas florestais]] (glossary of tags)
-    * [[http://www.linguateca.pt/Floresta/BibliaFlorestal/anexo4.html|Statistics of morphosyntactic tags]]
 ==== Domain ====
-Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000).
+Newspaper.
 ==== Size ====
-The CoNLL 2006 version contains 212,545 tokens in 9359 sentences, yielding 22.71 tokens per sentence on average (CoNLL 2006 data split: 206,678 tokens / 9071 sentences training, 5867 tokens / 288 sentences test).
+The corpus contains 36150 tokens in 4042 clauses, yielding 8.94 tokens per clause on average. There is no official training-test data split. We use the files ''t1.xml'' – ''t10.xml'' (2640 tokens / 266 clauses) for testing and the rest (33510 tokens / 3776 clauses) for training of our HamleDT experiments.
 ==== Inside ====

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Differences