[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:eu [2011/11/29 09:24]
zeman Documentation of part of speech tags.
user:zeman:treebanks:eu [2011/11/29 09:38]
zeman License.
Line 6: Line 6:
  
   * CoNLL 2007   * CoNLL 2007
-  * Extended version (2011?)+  * BDT-II (obtained per e-mail in 2011)
  
 ==== Obtaining and License ==== ==== Obtaining and License ====
  
 There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus. There does not seem to be any regular distribution channel for the Basque Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Koldo Gojenola (koldo (dot) gojenola (at) ehu (dot) es) about availability of the corpus.
 +
 +Informally agreed upon terms:
 +  * no redistribution
 +  * cite the principal publication (see below) in publications
  
 BDT was created by members of the [[http://ixa.si.ehu.es/|IXA Group]] (IXA taldea), University of the Basque Country (Euskal Herriko Unibertsitatea), 649 Posta kutxa, E-20080 Donostia, Spain. BDT was created by members of the [[http://ixa.si.ehu.es/|IXA Group]] (IXA taldea), University of the Basque Country (Euskal Herriko Unibertsitatea), 649 Posta kutxa, E-20080 Donostia, Spain.
Line 28: Line 32:
 ==== Domain ==== ==== Domain ====
  
-Mixed (“GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairsmanual transcripts and minutes of European parliamentary sessions), health, and travel.”)+Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LBi.e. Catalan and Spanish]”; “EPECa corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).
  
 ==== Size ==== ==== Size ====

[ Back to the navigation ] [ Back to the content ]