[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks [2011/11/19 13:04]
zeman Greek documentation.
user:zeman:treebanks [2011/11/20 18:53]
zeman Nějak už se to sem nevejde.
Line 1: Line 1:
 ====== Treebanks for Various Languages ====== ====== Treebanks for Various Languages ======
 +
 +  * [[user:zeman:treebanks:ar|Arabic (ar)]]
 +  * [[user:zeman:treebanks:bg|Bulgarian (bg)]]
 +  * [[user:zeman:treebanks:bn|Bengali (bn)]]
 +  * [[user:zeman:treebanks:ca|Catalan (ca)]]
 +  * [[user:zeman:treebanks:cs|Czech (cs)]]
 +  * [[user:zeman:treebanks:da|Danish (da)]]
 +  * [[user:zeman:treebanks:de|German (de)]]
 +  * [[user:zeman:treebanks:el|Greek (el)]]
 +  * [[user:zeman:treebanks:en|English (en)]]
  
 ===== Arabic (ar) ===== ===== Arabic (ar) =====
Line 1590: Line 1600:
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset. +The syntactic annotation style and the tagset for dependency relations (analytical functionsin GDT has been modeled after the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Prague Dependency Treebank]].
- +
-The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. +
- +
-The guidelines for syntactic annotation are documented in the other [[http://www.bultreebank.org/TechRep/BTB-TR05.pdf|technical report]]. The CoNLL distribution contains the BulTreeBankReadMe.html file with a brief description of the syntactic tags (dependency relation labels).+
  
 ==== Sample ==== ==== Sample ====
  
-The first three sentences of the CoNLL 2006 training data:+The first sentence of the CoNLL 2007 training data:
  
-| 1 | Глава Nc | _ | ROOT ROOT +| 1 | PUNCT PUNCT | _ | 10 AuxG 
-| 2 | трета Mo gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 2 | Τα ο At AtDf Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Nm Atr 
-| |||||||||| +αντισώματα αντίσωμα No NoCm Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Nm Sb | _ | _ 
-НАРОДНО | _ | An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +IgG | IgG | Rg | RgFwOr | _ | Atr _ | _ | 
-СЪБРАНИЕ | _ | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT +| 5 | είναι | είμαι | Vb | VbMn | Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 10 | Obj_Co | _ | _ 
-| |||||||||| +σαν | σαν | Ad | Ad | Ba | 5 | Adv | _ | _ | 
-Народното | _ | An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d mod mod +| 7 | μακροπρόθεσμη | μακροπρόθεσμος | Aj Aj Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm Atr _ | _ 
-събрание | _ | Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i subj subj +μνήμη μνήμη No NoCm Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm Adv | _ | _ 
-осъществява | _ | Vpi trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s ROOT ROOT +, | , | PUNCT | PUNCT | _ | 10 AuxX _ | _ | 
-законодателната | _ | Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d mod mod +| 10 | ενώ | ενώ | Cj | CjCo | _ | 26 | Coord | _ | _ | 
-власт | _ | Nc | _ | obj obj +| 11 | το | ο | At | AtDf | Ne<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm 12 Atr 
-и | _ | Cp | _ | conj conj +12 IgA | IgA | Rg | RgFwOr | _ | 15 | Sb | _ | _ | 
-упражнява Vpi trans=t<nowiki>|</nowiki>mood=i<nowiki>|</nowiki>tense=r<nowiki>|</nowiki>pers=3<nowiki>|</nowiki>num=s conjarg conjarg +| 13 | πιστεύεται | πιστεύεται | Vb VbMn Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 10 | Obj_Co | _ | _ 
-парламентарен | _ | Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +14 ότι | ότι | Cj | CjSb | _ | 13 | AuxC | _ | _ | 
-контрол | _ | Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i obj obj +| 15 | είναι | είμαι | Vb VbMn Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx 14 | Sb | _ | _ 
-10 | . | Punct Punct | _ | punct punct |+16 ένας | ένας | At | AtId | Ma<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 18 | Atr | _ | | 
 +| 17 | συγκεκριμένος | συγκεκριμένος | Aj | Aj Ba<nowiki>|</nowiki>Ma<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm 18 Atr _ | _ 
 +18 | δείκτης | δείκτης | No | NoCm | Ma<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 15 Pnom | _ | | 
 +| 19 | για | για | AsPp | AsPpSp | _ | 18 AuxP 
 +20 πρόσφατες | πρόσφατος | Aj | Aj | Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 21 | Atr_Co | _ | | 
 +| 21 | ή | ή | Cj | CjCo | _ | 23 Coord 
 +22 χρόνιες χρόνιος Aj Aj Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 21 | Atr_Co | _ | _ | 
 +| 23 | λοιμώξεις | λοίμωξη | No | NoCm | Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac 19 Atr _ | _ 
 +24 | " | " | PUNCT PUNCT | _ | 10 | AuxG | _ | _ | 
 +| 25 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ | 
 +| 26 | εξηγεί | εξηγώ | Vb VbMn Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Av<nowiki>|</nowiki>Xx | 0 | Pred | _ | _ 
 +27 η | ο | At | AtDf | Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 28 | Atr | _ | | 
 +28 | Δρ | Δρ | Rg | RgFwTr | _ | 26 | Sb | _ | _ | 
 +| 29 | Αρκάρι | Αρκάρι | No | NoCm | Ne<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm 28 Atr 
 +30 | . | PUNCT PUNCT | _ | AuxK |
  
-The first three sentences of the CoNLL 2006 test data:+The first sentence of the CoNLL 2007 test data:
  
-| 1 | Единственото An gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=d | 2 | mod mod +| 1 | Η ο At AtDf Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 2 | Atr 
-| 2 | решение Nc gen=n<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT +| 2 | Σίφνος Σίφνος No NoPr Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm Sb 
-| |||||||||| +φημίζεται φημίζομαι Vb VbMn Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 0 | Pred 
-| 1 | Ерик | _ | N | Np | gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i | 0 | ROOT ROOT +και | και | Cj | CjCo | _ | 5 | AuxY | _ | _ | 
-Франк | _ | Np gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 5 | για | για | AsPp | AsPpSp | _ | 3 | AuxP | _ | _ | 
-Ръсел Hm gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +| 6 | τα | ο | At AtDf Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac Atr 
-| |||||||||| +καταγάλανα καταγάλανος Aj Aj Ba<nowiki>|</nowiki>Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac Atr _ | _ 
-Пълен Am gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod +νερά νερό No NoCm Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac Obj | _ | _ 
-мрак Nc gen=m<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i ROOT ROOT +των ο At AtDf Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ge 11 Atr 
-и Cp conj conj +10 πανέμορφων πανέμορφος Aj Aj Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ge 11 Atr _ | _ 
-пълна Af gen=f<nowiki>|</nowiki>num=s<nowiki>|</nowiki>def=i mod mod | +11 ακτών ακτή No NoCm Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ge Atr | _ | _ 
-| 5 | самота | _ | N | Nc | _ | 2 | conjarg | 2 | conjarg +12 της μου Pn PnPo Fe<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Ge<nowiki>|</nowiki>Xx 11 Atr | _ | _ | 
-| . | Punct Punct | _ | punct punct |+13 | . | PUNCT PUNCT | _ | AuxK |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in BTB are rare. Only 747 of the 196,151 tokens in the CoNLL 2006 version are attached nonprojectively (0.38%).+Nonprojectivities in GDT are not frequent. Only 823 of the 70223 tokens in the CoNLL 2007 version are attached nonprojectively (1.17%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Bulgarian:+The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al.2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.) | 87.57 | 92.04 +Nakagawa | 76.31 | 84.08 | 
-| Malt (Nivre et al.) | 87.41 91.72 +| Keith Hall et al. | 74.21 | 82.04 | 
-Nara (Yuchang Cheng) | 86.34 91.30 |+| Carreras | 73.56 | 81.37 | 
 +| Malt (Nilsson et al.) | 74.65 81.22 | 
 +| Titov et al. | 73.52 | 81.20 | 
 +| Chen | 74.42 | 81.16 | 
 +| Duan | 74.29 | 80.77 | 
 +| Attardi et al. | 73.92 | 80.75 
 +| Malt (J. Hall et al.) | 74.21 80.66 
 + 
 +The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. 
 + 
 +===== English (en) ===== 
 + 
 +[[http://www.cis.upenn.edu/~treebank/|Penn Treebank]] 
 + 
 +==== Versions ==== 
 + 
 +  * Penn Treebank 2 (1995) 
 +  * Penn Treebank 3 (1999) 
 +  * CoNLL 2007 
 +  * CoNLL 2008 
 +  * CoNLL 2009 
 + 
 +==== Obtaining and License ==== 
 + 
 +The original Penn Treebank is distributed by the LDC under the catalogue number [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42|LDC99T42]]. It is free for LDC members 1999, price for non-members is unknown (contact LDC). The [[http://www.ldc.upenn.edu/Catalog/nonmem_agree/generic.license.html|license]] in short: 
 + 
 +  * non-commercial education and research usage 
 +  * no redistribution 
 +  * citation in publications not explicitly required but it is common decency 
 + 
 +The CoNLL 2007, 2008 and 2009 versions are also licensed by the LDC and LDC members can keep them after the shared task. Those who have not participated in the shared task may inquire at the LDC about the availability of the datasets. Their republication in LDC is planned but it has not happenned yet. 
 + 
 +The Penn Treebank was created by members of the [[http://www.cis.upenn.edu/|Department of Computer and Information Science]] (CIS), School of Engineering, University of Pennsylvania, Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104-6309, USA. The constituents-to-dependencies CoNLL 2007 conversion of the treebank was prepared by Ryan McDonald. 
 + 
 +==== References ==== 
 + 
 +  * Website 
 +    * http://www.cis.upenn.edu/~treebank/ 
 +  * Data 
 +    * Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor: //Treebank-3// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42|LDC99T42]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-163-9. 
 +  * Principal publications 
 +    * Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz: Building a large annotated corpus of English: the Penn Treebank. //Computational Linguistics,// 19(2):313-330. 1993. 
 +  * Documentation 
 +    * [[http://www.cis.upenn.edu/~treebank/tokenization.html|Tokenization]] 
 +    * Beatrice Santorini: [[ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz|Part-of-Speech Tagging Guidelines for the Penn Treebank Project]], 3rd Revision, Philadelphia, USA, 1990. 
 +    * Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre: [[ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz|Bracketing Guidelines for Treebank II Style, Penn Treebank Project]], Philadelphia, USA, 1995. 
 +    * Robert MacIntyre: [[ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2|NP Heads and Base NPs]] (Treebank FAQ) 
 +    * Richard Johansson, Pierre Nugues: [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|Extended constituent-to-dependency conversion for English]]. In: Proceedings of the 16th Nordic Conference on Computational Linguistics (NODALIDA), pp. 105-112, Tartu, Estonia, 2007. 
 + 
 +==== Domain ==== 
 + 
 +Financial news from the Wall Street Journal (1989). The constituent-based Treebank-3 also contains parsed versions of ATIS-3 and of the Brown Corpus. Only WSJ texts have been converted to dependencies for the CoNLL shared tasks. 
 + 
 +==== Size ==== 
 + 
 +Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. Sections 2-11 of the Wall Street Journal part of the treebank were used for training and a subset of section 23 was used for testing. 
 + 
 +^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ 
 +| CoNLL 2007 |  18577 |    446,573 |   214 |     5003 |        |          |  18791 |    451,576 |  24.03 | 
 +| CoNLL 2009 |  39279 |    958,167 |  1334 |    33368 |   2399 |    57676 |  43012 |  1,049,211 |  24.39 | 
 + 
 +==== Inside ==== 
 + 
 +The original Penn Treebank uses the [[:format-penn|Penn MRG ("merged") bracketing format]]. CoNLL 2007 uses the [[:format-conll|CoNLL-X format]]; CoNLL 2008 and 2009 format is slightly different (number and meaning of columns). 
 + 
 +Conversion for CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representation. The phrase structures were converted to dependency structures using the procedure described in [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|(Johansson and Nugues, 2007)]]. 
 + 
 +The original Penn Treebank contains non-terminal labels, function tags and part-of-speech tags, all assigned manually. The CoNLL 2009 version contains manual and automatic disambiguation. See above for documentation of the part-of-speech tags. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=en::penn|DZ Interset]] to inspect the tagset. The original treebank and the CoNLL 2007 version does not contain lemmas. The CoNLL 2009 version includes some lemmas but they are just lowercased word forms most of the time, e.g. nouns are not converted to singular. Nevertheless, there is some base-form normalization of verbs.
  

[ Back to the navigation ] [ Back to the content ]