[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:eu [2011/11/28 23:37]
zeman vytvořeno
user:zeman:treebanks:eu [2011/11/29 09:34]
zeman Domain.
Line 23: Line 23:
     * Itziar Aduriz, María Jesús Aranzabe, José María Arriola, Aitziber Atutxa, Arantza Díaz de Ilarraza, Aitzpea Garmendia, Maite Oronoz: [[http://w3.msi.vxu.se/~rics/TLT2003/doc/aduriz_et_al.pdf|Construction of a Basque Dependency Treebank]] In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), pp. 149-160, Växjö, Sweden, 2003.     * Itziar Aduriz, María Jesús Aranzabe, José María Arriola, Aitziber Atutxa, Arantza Díaz de Ilarraza, Aitzpea Garmendia, Maite Oronoz: [[http://w3.msi.vxu.se/~rics/TLT2003/doc/aduriz_et_al.pdf|Construction of a Basque Dependency Treebank]] In: Proceedings of The Second Workshop on Treebanks and Linguistic Theories (TLT 2003), pp. 149-160, Växjö, Sweden, 2003.
   * Documentation   * Documentation
-    * Description of tags and feature values is provided in the ''doc/README'' file in the CoNLL 2007 data distribution.+    * Description of tags and feature values is hard to find; the ''doc/README'' file in the CoNLL 2007 data distribution is not very informative. See below for information obtained per e-mail communication. 
 +    * María Jesús Aranzabe, José Mari Arriola, Aitziber Atutxa, Irene Balza, Larraitz Uria: [[http://ixa.si.ehu.es/Ixa/Argitalpenak/Barne_txostenak/1068549887/publikoak/guia.pdf|Guía para la anotación sintáctica manual de Eus3LB (corpus del euskera anotado a nivel sintáctico, semántico y pragmático)]]. UPV/EHU/LSI/TR 13-2003, Donostia, Spain, 2003.
  
 ==== Domain ==== ==== Domain ====
  
-Mixed (“GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairsmanual transcripts and minutes of European parliamentary sessions), health, and travel.”)+Newswire + unknown (“25000 word forms from EPEC (Aduriz et al., 2003) and 25000 word forms coming from newspapers that can be considered equivalent to the other corpora in the project [3LBi.e. Catalan and Spanish]”; “EPECa corpus of written Basque tagged at morphological and syntactic levels for the automatic processing”).
  
 ==== Size ==== ==== Size ====
Line 36: Line 37:
  
 The syntactic annotation style and the tagset for dependency relations (analytical functions) in GDT has been modeled after the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Prague Dependency Treebank]]. The syntactic annotation style and the tagset for dependency relations (analytical functions) in GDT has been modeled after the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Prague Dependency Treebank]].
 +
 +Part of speech tag description (obtained per e-mail from Koldo Gojenola, thanks!):
 +
 +  * IZE = noun
 +    * ARR = common
 +    * IZB = proper name
 +    * LIB = place name
 +    * ZKI = number
 +  * ADJ = adjective
 +    * ARR = common
 +    * GAL = question
 +  * ADI = verb
 +    * SIN = simple
 +    * ADK = composed
 +    * ADP = periphrastic
 +    * FAK = factitive
 +  * ADB = adverb
 +    * ARR = common
 +    * GAL = question
 +  * DET = determiner
 +    * ERKARR = demonstrative common
 +    * ERKIND = demonstrative emphatic
 +    * NOLARR = indefinite common
 +    * NOLGAL = indefinite question
 +    * ZNB = number
 +    * DZH = definite
 +    * BAN = distributive
 +    * ORD = ordinal
 +    * DZG = indefinite
 +    * ORO = general
 +  * IOR = pronoun
 +    * PERARR = personal common
 +    * PERIND = personal emphatic
 +    * IZGMGB = indefinite
 +    * IZGGAL = question
 +    * BIH = ???
 +    * ELK = ???
 +  * LOT = link
 +    * LOK = connector
 +    * JNT = conjunction
 +  * PRT = particle
 +  * ITJ = interjection
 +  * BST = other
 +  * ADL = auxiliary verb
 +  * ADT = synthetic verb
 +  * SIG = acronym
 +  * SNB = symbol
 +  * LAB = abbreviation
 +
 +Main features:
 +
 +  * KAS = case (ERG = ergative, ABS = absolutive, DAT = dative...)
 +  * ASP = aspect
 +  * ERL = relation (relative sentence, completive sentence, indirect question...)
  
 ==== Sample ==== ==== Sample ====

[ Back to the navigation ] [ Back to the content ]