[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks [2011/11/15 08:52]
zeman
user:zeman:treebanks [2011/11/20 18:02]
zeman English domain.
Line 189: Line 189:
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) columns of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the two CoNLL tagsets.+The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.bultreebank.org/TechRep/BTB-TR03.pdf|BTB positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=bg::conll|DZ Interset]] to inspect the CoNLL tagset.
  
 The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually. The morphological analysis does not include lemmas. The morphosyntactic tags have been assigned (probably) manually.
Line 769: Line 769:
 ==== Versions ==== ==== Versions ====
  
 +  * PDT 0.5 (1998)
   * PDT 1.0 (2001)   * PDT 1.0 (2001)
   * PDT 2.0 (2006)   * PDT 2.0 (2006)
Line 804: Line 805:
 ==== Domain ==== ==== Domain ====
  
-Newswire text from press agencies (Agence France PresseUmmahAl Hayat, An Nahar, Xinhua 2001-2003).+Newswire text (Lidové novinyMladá fronta Dnes)business weekly (Českomoravský Profit) and a scientific magazine (Vesmír).
  
 ==== Size ==== ==== Size ====
  
-According to their websitethe original PADT 1.0 contains 113,500 tokens annotated analytically. The CoNLL 2006 version contains 59752 tokens in 1606 sentencesyielding 37.21 tokens per sentence on average (CoNLL 2006 data split: 54379 tokens / 1460 sentences training, 5373 tokens / 146 sentences test). The CoNLL 2007 version contains 116,793 tokens in 3043 sentencesyielding 38.38 tokens per sentence on average (CoNLL 2007 data split: 111,669 tokens / 2912 sentences training, 5124 tokens / 131 sentences test).+All distributions of PDT are officially split to trainingdevelopment (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer)those that are annotated both morphologically and analytically (surface syntax; M+A layers)and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subsetwhich is relevant for surface dependency parsing.
  
-As noted in (Nivre et al., 2007), “the parsing units in this treebank are in many cases larger than conventional sentenceswhich partly explains the high average number of tokens per sentence.+Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. For CoNLL 2009only that part of PDT was selected that contained also tectogrammatical annotation, because the 2009 task included semantic learning. 
 + 
 +Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones). 
 + 
 +^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ 
 +| PDT 0.5 |     19126 |    327,597 |  3697 |    63718 |   3787 |    65390 |  26610 |    456,705 |  17.16 | 
 +| PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 | 
 +| PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 | 
 +| CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 | 
 +| CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 | 
 +| CoNLL 2009 |  38727 |    652,544 |  5228 |    87988 |   4213 |    70348 |  48168 |    810,880 |  16.83 |
  
 ==== Inside ==== ==== Inside ====
  
-The original PADT 1.0 is distributed in the [[:format-fs|FS format]]. The CoNLL versions are distributed in the [[:format-conll|CoNLL-X format]]. The original PADT contains more information than the CoNLL version. There is morphological annotation (tags and lemmasboth manual and by a tagger (while only manual is in CoNLL data), glosses etcHowever, the most important piece of information that got lost during the conversion to CoNLL is the FS attribute called ''parallel''. It distinguishes conjuncts from shared modifiers of coordination and thus the syntactic structure is incomplete without it.+PDT 1.0 is distributed in the [[::format-csts|CSTS format]]. PDT 2.0 uses the [[::format-pml|PML format]]. CoNLL 2006 and 2007 uses the [[:format-conll|CoNLL-X format]]CoNLL 2009 format is slightly different (number and meaning of columns). Unlike the other formats, the CSTS format used the ISO-8859-2 character encoding.
  
-Word forms and lemmas are vocalized, i.e. they contain diacritics for short vowels as well as consonant gemination and a few other things. The CoNLL 2006 version includes [[http://www.qamus.org/transliteration.htm|Buckwalter transliteration]] of the Arabic script (in the same column as Arabic, attached to the Arabic form/lemma with an underscore character).+The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguationThe official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology.
  
-Note that tokenization of Arabic typically includes splitting original words (inserting spaces between letters)not just separating punctuation from wordsExampleوبالفالوجة = wabiālfālūjah = wa/CONJ + bi/PREP + AlfAlwjp/NOUN_PROP = and in al-FalujahIn PADT, conjunctions and prepositions are separate tokens and nodes.+The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columnsCPOS, POS and FEATIn addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Hana and Zeman, 2005]]).
  
-The original PADT 1.0 uses 10-character positional morphological tags whose documentation is hard to find. The CoNLL 2006 version converts the tags to the three CoNLL columns, CPOS, POS and FEAT, most of the information being encoded as pipe-separated attribute-value assignments in FEAT. There //should// be a 1-1 mapping between the PADT positional tags and the CoNLL 2006 annotation. The CoNLL 2007 version uses a tag conversion different from CoNLL 2006Both CoNLL distributions contain a README file with a brief description of the parts of speech and features. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=ar::conll|DZ Interset]] to inspect the two CoNLL tagsets. Also look at the [[http://quest.ms.mff.cuni.cz/cgi-bin/elixir/index.fcgi|Elixir FM online interface]] for a later development of the morphological analyzer created along with PADT.+See above for documentation of the morphological tags. All CoNLL distributions contain a README file with a brief description of the parts of speech and features. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=cs::pdt|DZ Interset]] to inspect the PDT and the CoNLL tagsets.
  
-The guidelines for syntactic annotation are documented in the [[http://ufal.mff.cuni.cz/padt/PADT_1.0/docs/guides/PADT_Analytical.pdf|PADT annotation manual]] (only peculiarities of Arabic are documented, otherwise it is referenced to the annotation manual for the Czech treebank). The list and brief description of syntactic tags (dependency relation labels) can be found in [[http://ufal.mff.cuni.cz/padt/PADT_1.0/docs/papers/2002-flm-padt.pdf|(Smrž, Šnaidauf and Zemánek, 2002)]].+The guidelines for syntactic annotation are documented in the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|PDT annotation manual]].
  
 ==== Sample ==== ==== Sample ====
  
-The first two sentences of the CoNLL 2006 training data:+The first sentence of the PDT 1.0 training data:
  
-| 1 | غِيابُ_giyAbu غِياب_giyAb case=1<nowiki>|</nowiki>def=| 0 | ExD | _ | _ | +<code xml><csts lang=cs> 
-فُؤاد_fu&Ad فُؤاد_fu&Ad | _ | Atr | _ | _ | +<h> 
-كَنْعان_kanoEAn كَنْعان_kanoEAn | 1 | Atr | _ | _ |+<source>Českomoravský profit</source> 
 +<markup> 
 +<mauth>js 
 +<mdate>1996-2000 
 +<mdesc>Manual analytical annotation 
 +</markup> 
 +<markup> 
 +<mauth>kk,lk 
 +<mdate>1996-2000 
 +<mdesc>Manual morphological annotation 
 +</markup> 
 +</h> 
 +<doc file="s/inf/j/1994/cmpr9406" id="001"> 
 +<a> 
 +<mod>
 +<txtype>inf 
 +<genre>mix 
 +<med>
 +<temp>1994 
 +<authname>
 +<opus>cmpr9406 
 +<id>001 
 +</a> 
 +<c> 
 +<p n=1> 
 +<s id="cmpr9406:001-p1s1"> 
 +<p n=2> 
 +<s id="cmpr9406:001-p2s1"> 
 +<f cap>Třikrát<l>třikrát`3<t>Cv-------------<MDl src="a">třikrát`3<MDt src="a">Cv-------------<MDl src="b">třikrát`3<MDt src="b">Cv-------------<A>Adv<r>1<g>
 +<f>rychlejší<l>rychlý<t>AAFS1----2A----<MDl src="a">rychlý<MDt src="a">AANS1----2A----<MDl src="b">rychlý<MDt src="b">AAFS1----2A----<A>ExD<r>2<g>
 +<f>než<l>než-2<t>J,-------------<MDl src="a">než-2<MDt src="a">J,-------------<MDl src="b">než-2<MDt src="b">J,-------------<A>AuxC<r>3<g>
 +<f>slovo<l>slovo<t>NNNS1-----A----<MDl src="a">slovo<MDt src="a">NNNS4-----A----<MDl src="b">slovo<MDt src="b">NNNS1-----A----<A>ExD<r>4<g>3</code> 
 + 
 +The first two sentences of the PDT 1.0 d-test data: 
 + 
 +<code xml><csts lang=cs> 
 +<h> 
 +<source>Lidové noviny</source> 
 +<markup> 
 +<mauth>zu 
 +<mdate>1996-2000 
 +<mdesc>Manual analytical annotation 
 +</markup> 
 +</h> 
 +<doc file="s/pub/nws/1994/ln94206" id="1"> 
 +<a> 
 +<mod>
 +<txtype>pub 
 +<genre>mix 
 +<med>nws 
 +<temp>1994 
 +<authname>
 +<opus>ln94206 
 +<id>
 +</a> 
 +<c> 
 +<p n=1> 
 +<s id="ln94206:1-p1s1"> 
 +<i>ti 
 +<f cap>Lidé<MDl src="a">člověk<MDt src="a">NNMP1-----A---1<MDl src="b">člověk<MDt src="b">NNMP1-----A---1<A>ExD<r>1<g>
 +<p n=2> 
 +<s id="ln94206:1-p2s1"> 
 +<f upper.abbr>ING<MDl src="a">Ing-1_:B_^(inženýr)<MDt src="a">NNMXX-----A---8<MDl src="b">Ing-1_:B_^(inženýr)<MDt src="b">NNMXX-----A---8<A>Atr<r>1<g>
 +<D> 
 +<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxG<r>2<g>
 +<f upper>PETR<MDl src="a">Petr_;Y<MDt src="a">NNMS1-----A----<MDl src="b">Petr_;Y<MDt src="b">NNMS1-----A----<A>Atr<r>3<g>
 +<f upper>KARAS<MDl src="a">karas<MDt src="a">NNMS1-----A----<MDl src="b">karas<MDt src="b">NNMS1-----A----<A>Sb_Ap<r>4<g>11 
 +<D> 
 +<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>AuxX<r>5<g>
 +<f mixed>CSc<MDl src="a">CSc-1_:B_^(kandidát_věd)<MDt src="a">NNMXX-----A---8<MDl src="b">CSc-1_:B_^(kandidát_věd)<MDt src="b">NNMXX-----A---8<A>Atr<r>6<g>
 +<D> 
 +<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxG<r>7<g>
 +<d>(<MDl src="a">(<MDt src="a">Z:-------------<MDl src="b">(<MDt src="b">Z:-------------<A>ExD<r>8<g>
 +<D> 
 +<f num>53<MDl src="a">53<MDt src="a">C=-------------<MDl src="b">53<MDt src="b">C=-------------<A>ExD_Pa<r>9<g>
 +<D> 
 +<d>)<MDl src="a">)<MDt src="a">Z:-------------<MDl src="b">)<MDt src="b">Z:-------------<A>ExD<r>10<g>
 +<D> 
 +<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>Apos<r>11<g>20 
 +<f>generální<MDl src="a">generální<MDt src="a">AAMS1----1A----<MDl src="b">generální<MDt src="b">AAMS1----1A----<A>Atr<r>12<g>13 
 +<f>ředitel<MDl src="a">ředitel<MDt src="a">NNMS1-----A----<MDl src="b">ředitel<MDt src="b">NNMS1-----A----<A>Sb_Co<r>13<g>15 
 +<f upper>ČEZ<MDl src="a">ČEZ-1_:B_;K_^(České_energetické_závody)<MDt src="a">NNIPX-----A---8<MDl src="b">ČEZ-1_:B_;K_^(České_energetické_závody)<MDt src="b">NNIPX-----A---8<A>Atr<r>14<g>13 
 +<f>a<MDl src="a">a-1<MDt src="a">J^-------------<MDl src="b">a-1<MDt src="b">J^-------------<A>Coord_Ap<r>15<g>11 
 +<f>předseda<MDl src="a">předseda<MDt src="a">NNMS1-----A----<MDl src="b">předseda<MDt src="b">NNMS1-----A----<A>Sb_Co<r>16<g>15 
 +<f>jeho<MDl src="a">jeho_^(přivlast.)<MDt src="a">PSXXXZS3-------<MDl src="b">jeho_^(přivlast.)<MDt src="b">PSXXXZS3-------<A>Atr<r>17<g>18 
 +<f>představenstva<MDl src="a">představenstvo<MDt src="a">NNNS2-----A----<MDl src="b">představenstvo<MDt src="b">NNNS2-----A----<A>Atr<r>18<g>16 
 +<D> 
 +<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>AuxX<r>19<g>11 
 +<f>je<MDl src="a">být<MDt src="a">VB-S---3P-AA---<MDl src="b">být<MDt src="b">VB-S---3P-AA---<A>Pred<r>20<g>
 +<f>absolventem<MDl src="a">absolvent<MDt src="a">NNMS7-----A----<MDl src="b">absolvent<MDt src="b">NNMS7-----A----<A>Pnom<r>21<g>20 
 +<f>elektrotechnické<MDl src="a">elektrotechnický<MDt src="a">AAFS2----1A----<MDl src="b">elektrotechnický<MDt src="b">AAFS2----1A----<A>Atr<r>22<g>23 
 +<f>fakulty<MDl src="a">fakulta<MDt src="a">NNFS2-----A----<MDl src="b">fakulta<MDt src="b">NNFS2-----A----<A>Atr_Co<r>23<g>25 
 +<f upper>ČVUT<MDl src="a">ČVUT-1_:B_;K_^(České_vysoké_učení_technické)<MDt src="a">NNNXX-----A---8<MDl src="b">ČVUT-1_:B_;K_^(České_vysoké_učení_technické)<MDt src="b">NNNXX-----A---8<A>Atr<r>24<g>23 
 +<f>a<MDl src="a">a-1<MDt src="a">J^-------------<MDl src="b">a-1<MDt src="b">J^-------------<A>Coord<r>25<g>21 
 +<f>postgraduálního<MDl src="a">postgraduální<MDt src="a">AANS2----1A----<MDl src="b">postgraduální<MDt src="b">AANS2----1A----<A>Atr<r>26<g>27 
 +<f>studia<MDl src="a">studium<MDt src="a">NNNS2-----A----<MDl src="b">studium<MDt src="b">NNNS2-----A----<A>Atr_Co<r>27<g>25 
 +<f>v<MDl src="a">v-1<MDt src="a">RR--6----------<MDl src="b">v-1<MDt src="b">RR--6----------<A>AuxP<r>28<g>29 
 +<f>oboru<MDl src="a">obor_^(lidské_činnosti)<MDt src="a">NNIS6-----A----<MDl src="b">obor_^(lidské_činnosti)<MDt src="b">NNIS6-----A----<A>AuxP<r>29<g>27 
 +<f>metod<MDl src="a">metoda<MDt src="a">NNFP2-----A----<MDl src="b">metoda<MDt src="b">NNFP2-----A----<A>Atr<r>30<g>29 
 +<f>operační<MDl src="a">operační<MDt src="a">AAFS2----1A----<MDl src="b">operační<MDt src="b">AAFS2----1A----<A>Atr<r>31<g>32 
 +<f>analýzy<MDl src="a">analýza<MDt src="a">NNFS2-----A----<MDl src="b">analýza<MDt src="b">NNFS2-----A----<A>Atr<r>32<g>30 
 +<D> 
 +<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxK<r>33<g>0</code> 
 + 
 +The first sentence of the PDT 1.0 e-test data: 
 + 
 +<code xml><csts lang=cs> 
 +<h> 
 +<source>Lidové noviny</source> 
 +<markup> 
 +<mauth>zu 
 +<mdate>1996-2000 
 +<mdesc>Manual analytical annotation 
 +</markup> 
 +</h> 
 +<doc file="s/pub/nws/1994/ln94209" id="1"> 
 +<a> 
 +<mod>
 +<txtype>pub 
 +<genre>mix 
 +<med>nws 
 +<temp>1994 
 +<authname>
 +<opus>ln94209 
 +<id>
 +</a> 
 +<c> 
 +<p n=1> 
 +<s id="ln94209:1-p1s1"> 
 +<f cap>Přádelny<MDl src="a">přádelna<MDt src="a">NNFP1-----A----<MDl src="b">přádelna<MDt src="b">NNFP1-----A----<A>Sb<r>1<g>
 +<f>mají<MDl src="a">mít<MDt src="a">VB-P---3P-AA---<MDl src="b">mít<MDt src="b">VB-P---3P-AA---<A>Pred<r>2<g>
 +<f>dvojnásob<MDl src="a">dvojnásob<MDt src="a">Db-------------<MDl src="b">dvojnásob<MDt src="b">Db-------------<A>Obj<r>3<g>
 +<f>vad<MDl src="a">vada<MDt src="a">NNFP2-----A----<MDl src="b">vada<MDt src="b">NNFP2-----A----<A>Atr<r>4<g>3</code> 
 + 
 +Morphological annotation of the first amw training file of the PDT 2.0: 
 + 
 +<code xml><mdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> 
 + <head> 
 +  <schema href="mdata_schema.xml" /> 
 +  <references> 
 +   <reffile id="w" name="wdata" href="cmpr9406_001.w.gz" /> 
 +  </references> 
 + </head> 
 + <meta> 
 +  <lang>cs</lang> 
 +  <annotation_info id="manual"> 
 +   <desc>Manual annotation</desc> 
 +  </annotation_info> 
 + </meta> 
 + <s id="m-cmpr9406-001-p2s1"> 
 +  <m id="m-cmpr9406-001-p2s1w1"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w1</w.rf> 
 +   <form>Třikrát</form> 
 +   <lemma>třikrát`3</lemma> 
 +   <tag>Cv-------------</tag> 
 +  </m> 
 +  <m id="m-cmpr9406-001-p2s1w2"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w2</w.rf> 
 +   <form>rychlejší</form> 
 +   <lemma>rychlý</lemma> 
 +   <tag>AAFS1----2A----</tag> 
 +  </m> 
 +  <m id="m-cmpr9406-001-p2s1w3"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w3</w.rf> 
 +   <form>než</form> 
 +   <lemma>než-2</lemma> 
 +   <tag>J,-------------</tag> 
 +  </m> 
 +  <m id="m-cmpr9406-001-p2s1w4"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w4</w.rf> 
 +   <form>slovo</form> 
 +   <lemma>slovo</lemma> 
 +   <tag>NNNS1-----A----</tag> 
 +  </m> 
 + </s></code> 
 + 
 +Analytical (surface-syntactic) annotation of the first amw training file of the PDT 2.0: 
 + 
 +<code xml><adata xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> 
 + <head> 
 +  <schema href="adata_schema.xml" /> 
 +  <references> 
 +   <reffile id="m" name="mdata" href="cmpr9406_001.m.gz" /> 
 +   <reffile id="w" name="wdata" href="cmpr9406_001.w.gz" /> 
 +  </references> 
 + </head> 
 + <meta> 
 +  <annotation_info> 
 +   <desc>Manual annotation</desc> 
 +  </annotation_info> 
 + </meta> 
 + <trees> 
 +  <LM id="a-cmpr9406-001-p2s1"> 
 +   <s.rf>m#m-cmpr9406-001-p2s1</s.rf> 
 +   <ord>0</ord> 
 +   <children> 
 +    <LM id="a-cmpr9406-001-p2s1w2"> 
 +     <m.rf>m#m-cmpr9406-001-p2s1w2</m.rf> 
 +     <afun>ExD</afun> 
 +     <ord>2</ord> 
 +     <children> 
 +      <LM id="a-cmpr9406-001-p2s1w1"> 
 +       <m.rf>m#m-cmpr9406-001-p2s1w1</m.rf> 
 +       <afun>Adv</afun> 
 +       <ord>1</ord> 
 +      </LM> 
 +      <LM id="a-cmpr9406-001-p2s1w3"> 
 +       <m.rf>m#m-cmpr9406-001-p2s1w3</m.rf> 
 +       <afun>AuxC</afun> 
 +       <ord>3</ord> 
 +       <children> 
 +        <LM id="a-cmpr9406-001-p2s1w4"> 
 +         <m.rf>m#m-cmpr9406-001-p2s1w4</m.rf> 
 +         <afun>ExD</afun> 
 +         <ord>4</ord> 
 +        </LM> 
 +       </children> 
 +      </LM> 
 +     </children> 
 +    </LM> 
 +   </children> 
 +  </LM></code> 
 + 
 +The first two sentences of the CoNLL 2006 and 2007 training data: 
 + 
 +| 1 | Třikrát třikrát`3 _ | 2 | Adv | _ | _ | 
 +| 2 | rychlejší | rychlý | A | A | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=2<nowiki>|</nowiki>Neg=A | 0 | ExD | _ | _ | 
 +než než-2 | _ | AuxC | _ | _ | 
 +slovo slovo Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 3 | ExD | _ | _ |
 | |||||||||| | ||||||||||
-| 1 | فُؤاد_fu&Ad فُؤاد_fu&Ad | Z | Z | _ | 2 | Atr | _ | _ | +| 1 | Faxu fax | N | N | Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=3<nowiki>|</nowiki>Neg=A 2 | Obj | _ | _ | 
-| 2 | كَنْعان_kanoEAn | كَنْعان_kanoEAn | Z | Z | _ | 9 | Sb | _ | _ | +škodí škodit Num=P<nowiki>|</nowiki>Per=3<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A 0 | Pred | _ | _ | 
-| 3 | ،_, | ،_, | G | G | _ | 2 | AuxG | _ | _ | +především především | _ | AuxZ | _ | _ | 
-| 4 | رائِد_rA}id | رائِد_rA}id | N | N | _ | 2 | Atr | _ | _ | +přetížené přetížený Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=Atr | _ | _ | 
-| 5 | القِصَّة_AlqiS~ap | قِصَّة_qiS~ap | N | N | gen=F<nowiki>|</nowiki>num=S<nowiki>|</nowiki>def=Atr | _ | _ | +telefonní telefonní Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | Atr | _ | _ | 
-القَصِيرَةِ_AlqaSiyrapi قَصِير_qaSiyr gen=F<nowiki>|</nowiki>num=S<nowiki>|</nowiki>case=2<nowiki>|</nowiki>def=Atr | _ | _ | +linky linka | N | N | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=Sb | _ | _ | 
-فِي_fiy فِي_fiy | _ | AuxP | _ | _ | +| _ | AuxG | _ | _ |
-لُبْنانِ_lubonAni لُبْنان_lubonAn case=2<nowiki>|</nowiki>def=7 | Atr | _ | _ | +
-| 9 | رَحَلَ_raHala | رَحَل-َ_raHal-a | V | VP | pers=3<nowiki>|</nowiki>gen=M<nowiki>|</nowiki>num=Pred | _ | _ | +
-10 مَساءَ_masA'مَساء_masA' Adv _ | _ | +
-| 11 | أَمْسِ_>amosi أَمْسِ_>amosi D | D | _ | 10 | Atr | _ | _ | +
-12 عَن_Ean عَن_Ean | P | P | _ | 9 | AuxP | _ | _ | +
-| 13 | 81_81 | 81_81 | Q | Q | _ | 12 | Adv | _ | _ | +
-| 14 | عاماً_EAmAF | عام_EAm | N | N | gen=M<nowiki>|</nowiki>num=S<nowiki>|</nowiki>case=4<nowiki>|</nowiki>def=13 Atr | _ | _ | +
-15 ._. ._. | _ | AuxK | _ | _ |+
  
 The first sentence of the CoNLL 2006 test data: The first sentence of the CoNLL 2006 test data:
  
-| 1 | اِتِّفاقٌ_Ait~ifAqN اِتِّفاق_Ait~ifAq | N | N | case=1<nowiki>|</nowiki>def=ExD | _ | _ | +| 1 | Podobně podobně | D | g | Gra=1<nowiki>|</nowiki>Neg=A | 5 | Adv | _ | _ | 
-بَيْنَ_bayona بَيْنَ_bayona | P | P | _ | 1 | AuxP | _ | _ | +| 2 | , | , | Z | : | _ | 3 | AuxX | _ | _ | 
-لُبْنانِ_lubonAni لُبْنان_lubonAn | Z | case=2<nowiki>|</nowiki>def=| 4 | Atr | _ | _ | +| 3 | myslím | myslit | V | B | Num=S<nowiki>|</nowiki>Per=1<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 5 | Pred_Pa | _ | _ | 
-وَ_wa وَ_wa | _ | 2 | Coord | _ | _ | +| 4 | , | , | Z | : | _ | 3 | AuxX | _ | _ | 
-سُورِيَّةٍ_suwriy~apK سُورِيا_suwriyA gen=F<nowiki>|</nowiki>num=S<nowiki>|</nowiki>case=2<nowiki>|</nowiki>def=I | 4 | Atr | _ | _ | +| 5 | postupuje | postupovat | V | B | Num=S<nowiki>|</nowiki>Per=3<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 0 | Pred | _ | _ | 
-عَلَى_EalaY عَلَى_EalaY | _ | 1 | AuxP | _ | _ | +| 6 | většina | většina | N | N | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=Sb | _ | _ | 
-| 7 | رَفْعِ_rafoEi رَفْع_rafoE | N | N | case=2<nowiki>|</nowiki>def=R | 6 | Atr | _ | _ | +českých český A | A | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 8 | Atr | _ | _ | 
-مُسْتَوَى_musotawaY مُسْتَوَى_musotawaY | N | N | _ | | Atr | _ | _ | +| 8 | bank | banka | N | N | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Neg=A | 6 | Atr | _ | _ | 
-التَبادُلِ_AltabAduli تَبادُل_tabAdul | N | N | case=2<nowiki>|</nowiki>def=Atr | _ | _ | +| 9 | , | , | Z | : | _ | 11 | AuxX | _ | _ | 
-10 التِجارِيِّ_AltijAriy~i | تِجارِيّ_tijAriy~ | A | A | case=2<nowiki>|</nowiki>def=Atr | _ | _ | +| 10 | zejména | zejména | D | b | _ | 12 | AuxZ | _ | _ | 
-11 إِلَى_<ilaY إِلَى_<ilaY | _ | 7 | AuxP | _ | _ | +| 11 | v | v-| R | R | Cas=6 | 5 | AuxP | _ | _ | 
-12 500_500 500_500 | _ | 11 Atr | _ | _ | +12 případech případ N | N | Gen=I<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=6<nowiki>|</nowiki>Neg=A | 11 | Adv | _ | _ | 
-13 مِلْيُونِ_miloyuwni مِلْيُون_miloyuwn | N | N | case=2<nowiki>|</nowiki>def=| 12 | Atr | _ | _ | +| 13 | , | , | Z | : | _ | 17 | AuxX | _ | _ | 
-| 14 | دُولارٍ_duwlArK دُولار_duwlAr | N | N | case=2<nowiki>|</nowiki>def=| 13 | Atr | _ | _ |+| 14 | kdy | kdy | D | b | _ | 17 | Adv | _ | _ | 
 +| 15 | by | být | V | c | Num=X<nowiki>|</nowiki>Per=17 | AuxV | _ | _ | 
 +| 16 | se | se | P | 7 | Num=X<nowiki>|</nowiki>Cas=| 18 | AuxT | _ | _ | 
 +| 17 | mělo | mít | V | p | Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Per=X<nowiki>|</nowiki>Ten=R<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 12 | Atr | _ | _ | 
 +18 jednat jednat f | Neg=A | 17 | Obj | _ | _ | 
 +| 19 | o | o-1 | R | R | Cas=4 | 18 | AuxP | _ | _ | 
 +| 20 | větší | velký | A | A | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Gra=2<nowiki>|</nowiki>Neg=A | 21 | Atr | _ | _ | 
 +21 částky částka Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 19 | Obj | _ | _ | 
 +| 22 | . | . | Z | : | _ | 0 | AuxK | _ | _ | 
 + 
 +The first sentence of the CoNLL 2007 test data: 
 + 
 +| 1 | Proč | proč | D | b | _ | | Adv | _ | _ | 
 +| 2 | mají | mít | V | B | Num=P<nowiki>|</nowiki>Per=3<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 0 | Pred | _ | _ | 
 +| 3 | každý | každý | A | A | Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 4 | Atr | _ | _ | 
 +rok rok N | Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 5 | Adv | _ | _ | 
 +| 5 | fasovat | fasovat | V | f | Neg=A | 2 | Obj | _ | _ | 
 +| 6 | speciální | speciální | A | A | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 7 | Atr | _ | _ | 
 +| 7 | taxu taxa | N | N | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 5 | Obj | _ | _ | 
 +| 8 | na | na | R | R | Cas=4 | 7 | AuxP | _ | _ | 
 +| 9 | oblečení | oblečení | N | N | Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 8 | AtrAdv | _ | _ | 
 +| 10 | ? | ? | Z | : | _ | 0 | AuxK | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 training data: 
 + 
 +| 1 | Celní | celní | celní | A | A | SubPOS=A<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | SubPOS=A<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | | 2 | Atr | Atr | Y | celní | _ | RSTR | _ | 
 +| 2 | unie | unie | unie | N | N | SubPOS=N<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 0 | 0 | ExD | ExD | Y | unie | _ | _ | _ | 
 +| 3 | v | v | v | R | R | SubPOS=R<nowiki>|</nowiki>Cas=6 | SubPOS=R<nowiki>|</nowiki>Cas=6 | 2 | 2 | AuxP | AuxP | _ | _ | _ | _ | _ | 
 +4 | ohrožení ohrožení ohrožení | N | N | SubPOS=N<nowiki>|</nowiki>Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=6<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=6<nowiki>|</nowiki>Neg=A | 3 | 3 | Atr | Atr | Y | v-w3017f1 | _ | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 development data: 
 + 
 +| 1 | <nowiki>|</nowiki> | <nowiki>|</nowiki> | <nowiki>|</nowiki> | Z | Z | SubPOS=: | SubPOS=: | 0 | 3 | ExD | AuxG | _ | _ | _ | _ | 
 +| 2 | Daňový | daňový | daňový | A | A | SubPOS=A<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | SubPOS=A<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 3 | 3 | Atr | Atr | Y | daňový | _ | RSTR | 
 +| 3 | poradce | poradce | poradce | N | N | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 0 | 0 | ExD | ExD | Y | poradce | _ | _ | 
 +<nowiki>|</nowiki> | <nowiki>|</nowiki> | <nowiki>|</nowiki> | Z | Z | SubPOS=: | SubPOS=: | 0 | 3 | AuxK | AuxG | _ | _ | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 test data: 
 + 
 +| 1 | Názor | názor | názor | N | N | SubPOS=N<nowiki>|</nowiki>Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | _ | _ | _ | _ | Y | 
 +| experta | expert | expert | N | N | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Neg=A | _ | _ | _ | _ | Y 
 + 
 +==== Parsing ==== 
 + 
 +PDT is a mildly nonprojective treebank. 8351 of the 437,020 tokens in the CoNLL 2007 version are attached nonprojectively (1.91%). 
 + 
 +There is an [[http://ufal.mff.cuni.cz/czech-parsing/|online summary]] of known results in Czech parsing. 
 + 
 +The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Czech: 
 + 
 +^ Parser (Authors) ^ LAS ^ UAS ^ 
 +| MST (McDonald et al.) | 80.18 | 87.30 | 
 +| Basis (O'Neil) | 76.60 | 85.58 | 
 +| Malt (Nivre et al.) | 78.42 | 84.80 | 
 +| Nara (Yuchang Cheng) | 76.24 | 83.40 | 
 + 
 +The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Czech: 
 + 
 +^ Parser (Authors) ^ LAS ^ UAS ^ 
 +| Nakagawa | 80.19 | 86.28 | 
 +| Carreras | 78.60 | 85.16 | 
 +| Titov et al. | 77.94 | 84.19 | 
 +| Malt (Nilsson et al.) | 77.98 | 83.59 | 
 +| Attardi et al. | 77.37 | 83.40 | 
 +| Malt (Hall et al.) | 77.22 | 82.35 | 
 + 
 +The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. 
 + 
 +The results of the CoNLL 2009 shared task are [[http://ufal.mff.cuni.cz/conll2009-st/results/results.php|available online]]. They have been published in [[http://aclweb.org/anthology/W/W09/W09-1201.pdf|(Hajič et al., 2009)]]. Unlabeled attachment score was not published. These are the best results for Czech: 
 + 
 +^ Parser (Authors) ^ LAS ^ 
 +| Merlo (Gesmundo et al.) | 80.38 | 
 +| Bohnet | 80.11 | 
 +| Che et al. | 80.01 | 
 + 
 +===== Danish (da) ===== 
 + 
 +[[http://www.buch-kromann.dk/matthias/treebank/|Danish Dependency Treebank]] (DDT) 
 + 
 +==== Versions ==== 
 + 
 +  * Original DDT 1.0 in the [[http://www.tei-c.org/index.xml|TEI-based]] [[http://www.buch-kromann.dk/matthias/dtag/|DTAG]] or [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|Tiger-XML]] format 
 +  * CoNLL 2006 
 + 
 +The original DDT is based on [[http://www.buch-kromann.dk/matthias/files/diss-Dec05.pdf|Discontinuous Grammar]]. It natively encodes dependencies and other relations such as anaphora. The CoNLL version contains only the dependency relations. 
 + 
 +==== Obtaining and License ==== 
 + 
 +DDT is available under the [[http://www.gnu.org/licenses/gpl-2.0.html|GNU General Public License version 2]]. Download the original distribution (DTAG + TIGER-XML formats) from http://www.buch-kromann.dk/matthias/treebank/. Download the CoNLL 2006 conversion from http://ilk.uvt.nl/conll/free_data.html. The license in short: 
 + 
 +  * any usage, commercial or not 
 +  * modification and redistribution under same license permitted 
 +  * citation in publications not required (but it is common decency) 
 + 
 +DDT was created by members of the [[http://www.cbs.dk/en/Research/Departments-Centres/Institutter/ISV|Department of International Language Studies and Computational Linguistics]], Copenhagen Business School (Handelshøjskolen København), Dalgas Have 15, DK-2000 Frederiksberg, Denmark. The underlying [[http://korpus.dsl.dk/e-resurser/vilkaar.php?lang=|PAROLE]] corpus (morphologically annotated) was created by the [[http://www.dsl.dk/|Society for Danish Language and Literature]] (Det Danske Sprog- og Litteraturselskab), Christians Brygge 1, DK-1219 København K, Denmark. 
 + 
 +==== References ==== 
 + 
 +  * Website 
 +    * http://www.buch-kromann.dk/matthias/treebank/ (the old and no longer accessible website from <nowiki>http://www.id.cbs.dk/~mtk/</nowiki> has been moved here) 
 +  * Data 
 +    * //no separate citation// 
 +  * Principal publications 
 +    * Matthias Trautner Kromann: [[http://www.buch-kromann.dk/matthias/files/030730-tlt-norfa.pdf|The Danish Dependency Treebank and the DTAG Treebank Tool]]. In: Proceedings of Treebanks and Linguistic Theories, Växjö, Sweden, 2003. 
 +  * Documentation 
 +    * //see the left-hand-side links at the treebank website, eg.:// 
 +    * [[http://www.buch-kromann.dk/matthias/treebank/theory.html|Dependency theory and list of dependency relation labels]] 
 +    * Britt Keson: [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|Vejledning til det danske morfosyntaktisk taggede PAROLE-korpus]] (morphosyntactic tags). Det Danske Sprog- og Litteraturselskab (DSL) 
 + 
 +==== Domain ==== 
 + 
 +Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from a wide range of randomly selected linguistically representative Danish texts from 1983-1992.”) 
 + 
 +==== Size ==== 
 + 
 +The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test). 
 + 
 +==== Inside ==== 
 + 
 +The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotation. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset. 
 + 
 +The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. 
 + 
 +Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities. 
 + 
 +==== Sample ==== 
 + 
 +The first sentence of DDT 1.0 in the DTAG format: 
 + 
 +<code xml><tei.2> 
 +  <teiHeader type=text> 
 +    <fileDesc> 
 +      <titleStmt> 
 +        <title>Tagged sample of: 'Jeltsins skæbnetime'</title> 
 +      </titleStmt> 
 +      <extent words=158>158 running words</extent> 
 +      <publicationStmt> 
 +         <distributor>PAROLE-DK</distributor> 
 +         <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> 
 +         <date>1998-06-02</date> 
 +         <availability status=restricted><p>by agreement with distributor</availability> 
 +      </publicationStmt> 
 +      <sourceDesc> 
 +        <biblStruct> 
 +          <analytic> 
 +            <title>Jeltsins skæbnetime</title> 
 +            <author gender=m born=1925>Nikulin, Leon</author> 
 +          </analytic> 
 +          <monogr> 
 +            <imprint><pubPlace>Denmark</pubPlace> 
 +              <publisher>Det Fri Aktuelt</publisher> 
 +              <date>1992-12-01</date> 
 +            </imprint> 
 +          </monogr> 
 +        </biblStruct> 
 +      </sourceDesc> 
 +    </fileDesc> 
 +    <profileDesc> 
 +      <creation>1992-12-01</creation> 
 +      <langUsage><language>Danish</langUsage> 
 +      <textClass> 
 +        <catRef target="P.M2"> 
 +        <catRef target="P.G4.8"> 
 +        <catRef target="P.T9.3"> 
 +      </textClass> 
 +    </profileDesc> 
 +  </teiHeader> 
 +<text id=AJK> 
 +<body> 
 +<div1 type=main> 
 +<p> 
 +<s> 
 +<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> 
 +<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> 
 +<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> 
 +<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> 
 +<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> 
 +<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> 
 +<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> 
 +<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> 
 +<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> 
 +<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> 
 +<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> 
 +<W lemma="," msd="XP" in="-2:pnct" out="">,</W> 
 +<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> 
 +<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> 
 +<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> 
 +<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> 
 +<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> 
 +<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> 
 +<W lemma="&quot;" msd="XP" in="1:pnct" out="">"</W> 
 +<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> 
 +<W lemma="&quot;" msd="XP" in="-1:pnct" out="">"</W> 
 +<W lemma="." msd="XP" in="-12:pnct" out="">.</W> 
 +</s></code> 
 + 
 +The first sentence of the CoNLL 2006 training data: 
 + 
 +| 1 | Samme | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | _ | _ | 
 +| cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef nobj | _ | _ | 
 +3 | , | _ | X | XP | _ | 1 | pnct | _ | _ | 
 +| 4 | de | _ | P | PD | gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 7 | subj | _ | _ | 
 +| 5 | norske | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 4 | mod | _ | _ | 
 +| 6 | piger | _ | N | NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 4 | nobj | _ | _ | 
 +| 7 | tabte | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active | 1 | rel | _ | _ | 
 +med SP SP | _ | pobj | _ | _ | 
 +i_lørdags _ | RG | RG | degree=unmarked | 7 | mod | _ | _ | 
 +| 10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ | 
 +| 11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ | 
 +| 12 | værtsnation | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | 
 +| 13 | . | _ | X | XP | _ | 1 | pnct | _ | _ | 
 + 
 +The first sentence of the CoNLL 2006 test data: 
 + 
 +| 1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ | 
 +| kendte | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | 
 +| 3 | russiske | _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 1 | mod | _ | _ | 
 +| 4 | historikere | _ | N | NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 1 | nobj | _ | _ | 
 +| 5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ | 
 +| 6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | _ | 
 +| 7 | og | _ | C | CC | _ | 6 | coord | _ | _ | 
 +| 8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ | 
 +| 9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ | 
 +| 10 | tror | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 0 | ROOT | _ | _ | 
 +| 11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ | 
 +| 12 | , | _ | X | XP | _ | 10 | pnct | _ | _ | 
 +| 13 | at | _ | C | CS | _ | 10 | dobj | _ | _ | 
 +| 14 | Rusland | N | NP | case=unmarked | 15 | subj | _ | _ | 
 +| 15 | kan | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 13 | vobj | _ | _ | 
 +| 16 | udvikles | _ | V | VA | mood=infin<nowiki>|</nowiki>voice=passive | 15 | vobj | _ | _ | 
 +| 17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ | 
 +| 18 | en | _ | P | PI | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 17 | nobj | _ | _ | 
 +| 19 | " | _ | X | XP | _ | 20 | pnct | _ | _ | 
 +| 20 | jernnæve | _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 18 | nobj | _ | _ | 
 +| 21 | " | _ | X | XP | _ | 20 | pnct | _ | _ | 
 +| 22 | . | _ | X | XP | _ | 10 | pnct | _ | _ | 
 + 
 +==== Parsing ==== 
 + 
 +Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%). 
 + 
 +The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish: 
 + 
 +^ Parser (Authors) ^ LAS ^ UAS ^ 
 +| MST (McDonald et al.) | 84.79 | 90.58 | 
 +| Malt (Nivre et al.) | 84.77 | 89.80 | 
 +| Riedel et al. | 83.63 | 89.66 | 
 + 
 +===== German (de) ===== 
 + 
 +[[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/|TIGER Treebank]] 
 + 
 +==== Versions ==== 
 + 
 +  * TIGER Treebank 1 (2003) 
 +  * TIGER Treebank 2 (2005) 
 +  * TIGER Treebank 2.1 (2007) in [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] or Negra export (text) format 
 +  * CoNLL 2006 
 +  * CoNLL 2009 
 + 
 +==== Obtaining and License ==== 
 + 
 +The TIGER Treebank is freely downloadable after you accept the [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/license/htmllicense.shtml|license terms]] by pressing a button. 
 + 
 +Republication of the two CoNLL versions in LDC is planned but it has not happenned yet. 
 + 
 +The license in short: 
 + 
 +  * non-commercial research and evaluation usage by academic or educational institutions 
 +  * no redistribution 
 +  * acknowledge the use of the corpus in publications 
 + 
 +The TIGER Treebank was created by members of three institutes: 
 +  * [[http://www.coli.uni-saarland.de/|Department of Computational Linguistics and Phonetics]] (Computerlinguistik, CoLi), Saarland University (Universität des Saarlandes), Postfach 151150, D-66041 Saarbrücken, Germany. 
 +  * [[http://www.ims.uni-stuttgart.de/|Institute for Natural Language Processing]] (Institut für Maschinelle Sprachverarbeitung, IMS), University of Stuttgart (Universität Stuttgart), Azenbergstraße 12, D-70174 Stuttgart, Germany. 
 +  * [[http://www.uni-potsdam.de/germanistik/|German Department]] (Institut für Germanistik), Philosophische Fakultät, Universität Potsdam, Am Neuen Palais 10, Haus 05, D-14469 Potsdam, Germany. 
 + 
 +==== References ==== 
 + 
 +  * Website 
 +    * http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/ 
 +  * Data 
 +    * //no separate citation// 
 +  * Principal publications 
 +    * Sabine Brants, Stefanie Dipper, Silvia Hansen, Wolfgang Lezius, George Smith: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/paper/treeling2002.pdf|The TIGER Treebank]]. In: Proceedings of the Workshop on Treebanks and Linguistic Theories (TLT), Sozopol, Bulgaria, 2002. 
 +    * [[http://www.ims.uni-stuttgart.de/projekte/TIGER/paper/|List of publications]] 
 +  * [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/annotation/|Documentation]] 
 +    * [[http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html|Stuttgart-Tübingen Tagset]] (part of speech) 
 +    * Berthold Crysmann, Silvia Hansen-Schirra, George Smith, Dorothea Ziegler-Eisele: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/annotation/tiger_scheme-morph.pdf|TIGER Morphologie-Annotationsschema]], 2005. 
 +    * Stefanie Albert, Jan Anderssen, Regine Bader, Stephanie Becker, Tobias Bracht, Sabine Brants, Thorsten Brants, Vera Demberg, Stefanie Dipper, Peter Eisenberg, Silvia Hansen, Hagen Hirschmann, Juliane Janitzek, Carolin Kirstein, Robert Langner, Lukas Michelbacher, Oliver Plaehn, Cordula Preis, Marcus Pußel, Marco Rower, Bettina Schrader, Anne Schwartz, George Smith, Hans Uszkoreit: [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/annotation/tiger_scheme-syntax.pdf|TIGER Annotationsschema]] //(syntax)//, 2003. 
 +    * The header of the XML version of the TIGER Treebank contains lists of various sorts of tags with brief explanation. 
 + 
 +==== Domain ==== 
 + 
 +Mostly newswire (Frankfurter Rundschau). 
 + 
 +==== Size ==== 
 + 
 +According to their website, the TIGER Treebank version 1 contains approximately 700,000 tokens in 40,000 sentences. Version 2.1 contains approximately 900,000 tokens in 50,000 sentences. 
 + 
 +The CoNLL 2006 version contains 705,304 tokens in 39573 sentences, yielding 17.82 tokens per sentence on average (CoNLL 2006 data split: 699,610 tokens / 39216 sentences training, 5694 tokens / 357 sentences test). 
 + 
 +The CoNLL 2009 version contains 712,332 tokens in 40020 sentences, yielding 17.80 tokens per sentence on average (CoNLL 2009 data split: 648,677 tokens / 36020 sentences training, 32033 tokens / 2000 sentences development, 31622 tokens / 2000 sentences test). 
 + 
 +==== Inside ==== 
 + 
 +All versions contain //semi-automatic// part of speech tags ([[http://www.ims.uni-stuttgart.de/projekte/corplex/TagSets/stts-table.html|Stuttgart-Tübingen Tagset]], STTS) and syntactic structure. Lemmas and morphosyntactic features are available only for newer versions (TIGER Treebank version 2 and onwards, and CoNLL 2009). The parts of speech are heavily context-dependent, e.g. many words can be used both substantively (pronouns) and attributively (determiners), which is distinguished by different POS tags. 
 + 
 +It is not clear what the //semi-automatic// annotation means (probably first auto-tagging, then manual correction?) and whether it also applies to the morphosyntactic annotation. The CoNLL 2009 version also contains automatically disambiguated lemmas, tags and features. 
 + 
 +The original treebank is phrase-based. The dependencies in the CoNLL versions must have thus been drawn using a head-selection procedure. Besides CoNLL data, the TIGER project also provides a subset of the TIGER Treebank in a dependency format. 
 + 
 +==== Sample ==== 
 + 
 +The first sentence of TIGER Treebank 2.1 in the TIGER-XML format: 
 + 
 +<code xml><s id="s1"> 
 +  <graph root="s1_VROOT"> 
 +    <terminals> 
 +      <t id="s1_1" word="``" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" /> 
 +      <t id="s1_2" word="Ross" lemma="Ross" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> 
 +      <t id="s1_3" word="Perot" lemma="Perot" pos="NE" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> 
 +      <t id="s1_4" word="wäre" lemma="sein" pos="VAFIN" morph="3.Sg.Past.Subj" case="--" number="Sg" gender="--" person="3" degree="--" tense="Past" mood="Subj" /> 
 +      <t id="s1_5" word="vielleicht" lemma="vielleicht" pos="ADV" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" /> 
 +      <t id="s1_6" word="ein" lemma="ein" pos="ART" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> 
 +      <t id="s1_7" word="prächtiger" lemma="prächtig" pos="ADJA" morph="Pos.Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="Pos" tense="--" mood="--" /> 
 +      <t id="s1_8" word="Diktator" lemma="Diktator" pos="NN" morph="Nom.Sg.Masc" case="Nom" number="Sg" gender="Masc" person="--" degree="--" tense="--" mood="--" /> 
 +      <t id="s1_9" word="''" lemma="--" pos="$(" morph="--" case="--" number="--" gender="--" person="--" degree="--" tense="--" mood="--" /> 
 +    </terminals> 
 +    <nonterminals> 
 +      <nt id="s1_500" cat="PN"> 
 +        <edge label="PNC" idref="s1_2" /> 
 +        <edge label="PNC" idref="s1_3" /> 
 +      </nt> 
 +      <nt id="s1_501" cat="NP"> 
 +        <edge label="NK" idref="s1_6" /> 
 +        <edge label="NK" idref="s1_7" /> 
 +        <edge label="NK" idref="s1_8" /> 
 +      </nt> 
 +      <nt id="s1_502" cat="S"> 
 +        <edge label="SB" idref="s1_500" /> 
 +        <edge label="HD" idref="s1_4" /> 
 +        <edge label="MO" idref="s1_5" /> 
 +        <edge label="PD" idref="s1_501" /> 
 +      </nt> 
 +      <nt id="s1_VROOT" cat="VROOT"> 
 +        <edge label="--" idref="s1_1" /> 
 +        <edge label="--" idref="s1_502" /> 
 +        <edge label="--" idref="s1_9" /> 
 +      </nt> 
 +    </nonterminals> 
 +  </graph> 
 +</s></code> 
 + 
 +The first sentence of the CoNLL 2006 training data: 
 + 
 +| 1 | `` | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | 
 +| 2 | Ross | _ | NE | NE | _ | 4 | SB | 4 | SB | 
 +| 3 | Perot | _ | NE | NE | _ | 2 | PNC | 2 | PNC | 
 +| 4 | wäre | _ | VAFIN | VAFIN | _ | 0 | ROOT | 0 | ROOT | 
 +| 5 | vielleicht | _ | ADV | ADV | _ | 4 | MO | 4 | MO | 
 +| 6 | ein | _ | ART | ART | _ | 8 | NK | 8 | NK | 
 +| 7 | prächtiger | _ | ADJA | ADJA | _ | 8 | NK | 8 | NK | 
 +| 8 | Diktator | _ | NN | NN | _ | 4 | PD | 4 | PD | 
 +| 9 | <nowiki>''</nowiki> | _ | $( | $( | _ | 4 | PUNC | 4 | PUNC | 
 + 
 +The first sentence of the CoNLL 2006 test data: 
 + 
 +| 1 | Zwei | _ | CARD | CARD | _ | 2 | NK | 2 | NK | 
 +| 2 | Themen | _ | NN | NN | _ | 14 | SB | 14 | SB | 
 +| 3 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | 
 +| 4 | die | _ | PRELS | PRELS | _ | 8 | OA | 8 | OA | 
 +| 5 | Perot | _ | NE | NE | _ | 8 | SB | 8 | SB | 
 +| 6 | immer | _ | ADV | ADV | _ | 7 | MO | 7 | MO | 
 +| 7 | wieder | _ | ADV | ADV | _ | 8 | MO | 8 | MO | 
 +| 8 | anspricht | _ | VVFIN | VVFIN | _ | 2 | RC | 2 | RC | 
 +| 9 | , | _ | $, | $, | _ | 2 | PUNC | 2 | PUNC | 
 +| 10 | Rezession | _ | NN | NN | _ | 2 | APP | 2 | APP | 
 +| 11 | und | _ | KON | KON | _ | 10 | CD | 10 | CD | 
 +| 12 | Bürokratie | _ | NN | NN | _ | 10 | CJ | 10 | CJ | 
 +| 13 | | _ | $, | $, | _ | 14 | PUNC | 14 | PUNC | 
 +| 14 | machen | _ | VVFIN | VVFIN | _ | 0 | ROOT | 0 | ROOT | 
 +| 15 | ihnen | _ | PPER | PPER | _ | 18 | DA | 18 | DA | 
 +| 16 | besonders | _ | ADV | ADV | _ | 18 | MO | 18 | MO | 
 +| 17 | zu | _ | PTKZU | PTKZU | _ | 18 | PM | 18 | PM | 
 +| 18 | schaffen | _ | VVINF | VVINF | _ | 14 | OC | 14 | OC | 
 +| 19 | . | _ | $. | $. | _ | 14 | PUNC | 14 | PUNC | 
 + 
 +The first sentence of the CoNLL 2009 training data: 
 + 
 +| 1 | `` | _ | `` | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | 
 +| 2 | Ross | Ross | Roß | NE | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | _ | 3 | 3 | PNC | PNC | _ | _ | 
 +| 3 | Perot | Perot | Perot | NE | NE | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | _ | 4 | 4 | SB | SB | _ | _ | 
 +| 4 | wäre | sein | sein | VAFIN | VAFIN | 3<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Past<nowiki>|</nowiki>Subj | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Past<nowiki>|</nowiki>Subj | 0 | 0 | ROOT | ROOT | _ | _ | 
 +| 5 | vielleicht | vielleicht | vielleicht | ADV | ADV | _ | _ | 4 | 4 | MO | MO | _ | _ | 
 +| 6 | ein | ein | ein | ART | ART | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | 8 | 8 | NK | NK | _ | _ | 
 +| 7 | prächtiger | prächtig | prächtig | ADJA | ADJA | Pos<nowiki>|</nowiki>Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>*<nowiki>|</nowiki>* | 8 | 8 | NK | NK | _ | _ | 
 +| 8 | Diktator | Diktator | Diktator | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | 4 | 4 | PD | PD | _ | _ | 
 +| 9 | <nowiki>''</nowiki> | _ | <nowiki>''</nowiki> | $( | $( | _ | _ | 4 | 4 | PUNC | PUNC | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 development data: 
 + 
 +| 1 | Maschinenbau | Maschinenbau | Maschinenbau | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Masc | 0 | 4 | ROOT | NK | _ | _ | 
 +| 2 | / | _ | / | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | 
 +| 3 | ( | _ | ( | $( | $( | _ | _ | 0 | 4 | PUNC | PUNC | _ | _ | 
 +| 4 | Zusammenfassung | Zusammenfassung | Zusammenfassung | NN | NN | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | 0 | 0 | ROOT | ROOT | _ | _ | 
 +| 5 | ) | _ | ) | $( | $( | _ | _ | 0 | 1 | PUNC | PUNC | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 test data: 
 + 
 +| 1 | Gegen | gegen | gegen | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
 +| 2 | eine | ein | ein | ART | ART | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
 +| 3 | Erweiterung | Erweiterung | Erweiterung | NN | NN | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
 +| 4 | ihrer | ihr | ihr | PPOSAT | PPOSAT | Gen<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 5 | Organisation | Organisation | Organisation | NN | NN | Gen<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
 +| 6 | zu | zu | zu | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
 +| 7 | einem | ein | ein | ART | ART | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 8 | sicherheitspolitischen | sicherheitspolitisch | sicherheitspolitisch | ADJA | ADJA | Pos<nowiki>|</nowiki>Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Pos<nowiki>|</nowiki>*<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 9 | Forum | Forum | Forum | NN | NN | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ | 
 +| 10 | sprachen | sprechen | sprechen | VVFIN | VVFIN | 3<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Past<nowiki>|</nowiki>Ind | *<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Past<nowiki>|</nowiki>Ind | _ | _ | _ | _ | Y | 
 +| 11 | sich | sich | er<nowiki>|</nowiki>es<nowiki>|</nowiki>sie<nowiki>|</nowiki>Sie | PRF | PRF | 3<nowiki>|</nowiki>Acc<nowiki>|</nowiki>Pl | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 12 | die | der | d | ART | ART | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 13 | meisten | meister | meist | PIAT | PIAT | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 14 | Staaten | Staat | Staat | NN | NN | Nom<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | *<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Masc | _ | _ | _ | _ | _ | 
 +| 15 | beim | bei | beim | APPRART | APPRART | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 16 | Gipfeltreffen | Gipfeltreffen | Gipfeltreffen | NN | NN | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>*<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ | 
 +| 17 | für | für | für | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
 +| 18 | Asiatisch-Pazifische | asiatisch-pazifisch | Asiatisch-Pazifische | ADJA | NN | Pos<nowiki>|</nowiki>Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>*<nowiki>|</nowiki>* | _ | _ | _ | _ | _ | 
 +| 19 | Wirtschaftskooperation | Wirtschaftskooperation | Wirtschaftskooperation | NN | NN | Acc<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | 
 +| 20 | ( | _ | ( | $( | $( | _ | _ | _ | _ | _ | _ | _ | 
 +| 21 | Apec | Apec | _ | NE | NE | Nom<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Fem | _ | _ | _ | _ | _ | _ | 
 +| 22 | ) | _ | ) | $( | $( | _ | _ | _ | _ | _ | _ | _ | 
 +| 23 | in | in | in | APPR | APPR | _ | _ | _ | _ | _ | _ | _ | 
 +| 24 | Osaka | Osaka | Osaka | NE | NE | Dat<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | *<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Neut | _ | _ | _ | _ | _ | 
 +| 25 | aus | aus | aus | PTKVZ | PTKVZ | _ | _ | _ | _ | _ | _ | _ | 
 +| 26 | . | _ | . | $. | $. | _ | _ | _ | _ | _ | _ | _ | 
 + 
 +==== Parsing ==== 
 + 
 +TIGER is a mildly nonprojective treebank. 15875 of the 680,710 tokens in the CoNLL 2009 training+development datasets are attached nonprojectively (2.33%). 
 + 
 +The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for German: 
 + 
 +^ Parser (Authors) ^ LAS ^ UAS ^ 
 +| MST (McDonald et al.) | 87.34 | 90.38 | 
 +| Riedel et al. | 86.24 | 89.76 | 
 +| Basis (O'Neil) | 85.36 | 89.16 | 
 +| Malt (Nivre et al.) | 85.82 | 88.76 | 
 + 
 +The results of the CoNLL 2009 shared task are [[http://ufal.mff.cuni.cz/conll2009-st/results/results.php|available online]]. They have been published in [[http://aclweb.org/anthology/W/W09/W09-1201.pdf|(Hajič et al., 2009)]]. Unlabeled attachment score was not published. These are the best results for German: 
 + 
 +^ Parser (Authors) ^ LAS ^ 
 +| Bohnet | 87.48 | 
 +| Merlo | 87.29 | 
 +| Chen | 86.24 | 
 +| Che | 86.19 | 
 + 
 +===== Greek (el) ===== 
 + 
 +Greek Dependency Treebank (GDT) 
 + 
 +==== Versions ==== 
 + 
 +  * CoNLL 2007 
 + 
 +==== Obtaining and License ==== 
 + 
 +There does not seem to be any regular distribution channel for the Greek Dependency Treebank. The CoNLL 2007 version had a restricted license for the duration of the shared task only. Republication of the CoNLL version in LDC is planned but it has not happenned yet. In the meantime, one can ask Prokopis Prokopidis (prokopis (at) ilsp (dot) gr) about availability of the corpus. 
 + 
 +GDT was created by members of the [[http://www.ilsp.gr/|Institute for Language and Speech Processing]] (Ινστιτούτο Επεξεργασίας του Λόγου, ILSP/ΙΕΛ), Επιδαύρου & Αρτέμιδος 6, Παράδεισος Αμαρουσίου, GR-15125 Αθήνα, Greece. 
 + 
 +==== References ==== 
 + 
 +  * Website 
 +    * //no website dedicated to the treebank// 
 +  * Data 
 +    * //no separate citation// 
 +  * Principal publications 
 +    * Prokopis Prokopidis, Elina Desipri, Maria Koutsombogera, Harris Papageorgiou, Stelios Piperidis: [[http://www.ilsp.gr/homepages/prokopidis/documents/gdt_tlt2005.pdf|Theoretical and Practical Issues in the Construction of a Greek Dependency Corpus]] In: Montserrat Civit, Sandra Kübler, Ma. Antònia Martí (eds.), Proceedings of The Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005), pp. 149-160, Barcelona, Spain, 2005. 
 +  * Documentation 
 +    * Description of tags and feature values is provided in the ''doc/README'' file in the CoNLL 2007 data distribution. 
 + 
 +==== Domain ==== 
 + 
 +Mixed (“GDT consists of randomly selected textual fragments and texts in three domains: politics (current affairs, manual transcripts and minutes of European parliamentary sessions), health, and travel.”) 
 + 
 +==== Size ==== 
 + 
 +The CoNLL 2007 version contains 70223 tokens in 2902 sentences, yielding 24.20 tokens per sentence on average (CoNLL 2007 data split: 65419 tokens / 2705 sentences training, 4804 tokens / 197 sentences test). 
 + 
 +==== Inside ==== 
 + 
 +The syntactic annotation style and the tagset for dependency relations (analytical functions) in GDT has been modeled after the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|Prague Dependency Treebank]]. 
 + 
 +==== Sample ====
  
 The first sentence of the CoNLL 2007 training data: The first sentence of the CoNLL 2007 training data:
  
-| 1 | تَعْدادُ تَعْداد_1 N- Case=1<nowiki>|</nowiki>Defin=R Sb | _ | _ | +| 1 | PUNCT PUNCT _ | 10 | AuxG | _ | _ | 
-سُكّانِ ساكِن_1 N- Case=2<nowiki>|</nowiki>Defin=R Atr | _ | _ | +| 2 | Τα | ο | At | AtDf | Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Nm 3 | Atr | _ | _ | 
-22 [DEFAULT] Q- | _ | | Atr | _ | _ | +αντισώματα αντίσωμα No NoCm Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Nm 5 | Sb | _ | _ | 
-دَوْلَةً دَوْلَة_1 N- Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=4<nowiki>|</nowiki>Defin=I Atr | _ | _ | +IgG IgG Rg RgFwOr | _ | | Atr | _ | _ | 
-| 5 | عَرَبِيَّةً عَرَبِيّ_1 A- Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=4<nowiki>|</nowiki>Defin=I | Atr | _ | _ | +είναι είμαι Vb VbMn Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 10 | Obj_Co | _ | _ | 
-| 6 | سَ سَ_FUT F- | _ | AuxM | _ | _ | +| 6 | σαν | σαν | Ad | Ad | Ba | 5 | Adv | _ | _ | 
-يَرْتَفِعُ اِرْتَفَع_1 VI Mood=I<nowiki>|</nowiki>Voice=A<nowiki>|</nowiki>Person=3<nowiki>|</nowiki>Gender=M<nowiki>|</nowiki>Number=S Pred | _ | _ | +| 7 | μακροπρόθεσμη μακροπρόθεσμος Aj Aj Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | Atr | _ | _ | 
-إِلَى إِلَى_1 P- | _ | AuxP | _ | _ | +| 8 | μνήμη | μνήμη | No | NoCm | Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 6 | Adv | 
-654 [DEFAULT] Q- Adv | _ | _ | +| 9 | , | , | PUNCT | PUNCT | _ | 10 AuxX | _ | _ | 
-10 مِلْيُونَ مِلْيُون_1 N- Case=4<nowiki>|</nowiki>Defin=R | Atr | _ | _ | +10 ενώ ενώ Cj CjCo _ | 26 | Coord | _ | _ | 
-11 نَسَمَةٍ نَسَمَة_1 N- Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=2<nowiki>|</nowiki>Defin=I 10 | Atr | _ | _ | +| 11 | το | ο | At | AtDf | Ne<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 12 | Atr | _ | _ | 
-12 فِي فِي_1 P- | _ | | AuxP | _ | _ | +| 12 | IgA | IgA | Rg | RgFwOr | _ | 15 | Sb | _ | _ | 
-13 مُنْتَصَفِ مُنْتَصَف_1 N- Case=2<nowiki>|</nowiki>Defin=R 12 Adv | _ | _ | +| 13 | πιστεύεται | πιστεύεται | Vb | VbMn | Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 10 | Obj_Co | _ | _ | 
-14 القَرْنِ قَرْن_1 N- Case=2<nowiki>|</nowiki>Defin=D 13 | Atr | _ | _ |+14 ότι ότι Cj CjSb | _ | 13 AuxC | _ | _ | 
 +15 είναι είμαι Vb VbMn Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 14 | Sb | _ | _ | 
 +16 ένας ένας At AtId Ma<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 18 | Atr | _ | _ | 
 +17 συγκεκριμένος συγκεκριμένος Aj Aj Ba<nowiki>|</nowiki>Ma<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm 18 | Atr | _ | _ | 
 +18 δείκτης δείκτης No NoCm | Ma<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 15 | Pnom | _ | _ | 
 +| 19 | για | για | AsPp | AsPpSp | _ | 18 | AuxP | _ | _ | 
 +20 πρόσφατες πρόσφατος Aj Aj Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 21 | Atr_Co | _ | _ | 
 +21 ή ή Cj CjCo _ | 23 | Coord | _ | _ | 
 +| 22 | χρόνιες | χρόνιος | Aj | Aj | Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 21 | Atr_Co | _ | _ | 
 +| 23 | λοιμώξεις | λοίμωξη | No | NoCm | Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 19 | Atr | _ | _ | 
 +| 24 | " | " | PUNCT | PUNCT | _ | 10 | AuxG | _ | _ | 
 +| 25 | , | , | PUNCT | PUNCT | _ | 10 | AuxX | _ | _ | 
 +| 26 | εξηγεί | εξηγώ | Vb | VbMn | Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Av<nowiki>|</nowiki>Xx | 0 | Pred | _ | _ | 
 +| 27 | η | ο | At | AtDf | Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 28 | Atr | _ | _ | 
 +| 28 | Δρ | Δρ | Rg | RgFwTr | _ | 26 | Sb | _ | _ | 
 +| 29 | Αρκάρι | Αρκάρι | No | NoCm | Ne<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 28 | Atr | _ | _ | 
 +| 30 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ |
  
 The first sentence of the CoNLL 2007 test data: The first sentence of the CoNLL 2007 test data:
  
-| 1 | مُقاوَمَةُ مُقاوَمَة_1 N- Gender=F<nowiki>|</nowiki>Number=S<nowiki>|</nowiki>Case=1<nowiki>|</nowiki>Defin=R | 0 | ExD | _ | _ | +| 1 | Η ο At AtDf Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 2 | Atr | _ | _ | 
-زَواجِ زَواج_1 N- Case=2<nowiki>|</nowiki>Defin=R | Atr | _ | _ | +| 2 | Σίφνος | Σίφνος | No | NoPr | Fe<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Nm | 3 | Sb | _ | _ | 
-الطُلّابِ طالِب_1 N- Case=2<nowiki>|</nowiki>Defin=D | Atr | _ | _ | +| 3 | φημίζεται | φημίζομαι | Vb | VbMn | Id<nowiki>|</nowiki>Pr<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Xx<nowiki>|</nowiki>Ip<nowiki>|</nowiki>Pv<nowiki>|</nowiki>Xx | 0 | Pred | _ | _ | 
-العُرْفِيِّ عُرْفِيّ_1 A- Case=2<nowiki>|</nowiki>Defin=D | Atr | _ | _ |+και και Cj CjCo _ | 5 | AuxY | _ | _ | 
 +| 5 | για | για | AsPp | AsPpSp | _ | 3 | AuxP | _ | _ | 
 +| 6 | τα | ο | At | AtDf | Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 8 | Atr | _ | _ | 
 +καταγάλανα καταγάλανος Aj Aj Ba<nowiki>|</nowiki>Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 8 | Atr | _ | _ | 
 +νερά νερό No NoCm Ne<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ac | 5 | Obj | _ | _ | 
 +| 9 | των | ο | At | AtDf | Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ge | 11 | Atr | _ | _ | 
 +| 10 | πανέμορφων | πανέμορφος | Aj | Aj | Ba<nowiki>|</nowiki>Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ge | 11 | Atr | _ | _ | 
 +| 11 | ακτών | ακτή | No | NoCm | Fe<nowiki>|</nowiki>Pl<nowiki>|</nowiki>Ge | 8 | Atr | _ | _ | 
 +| 12 | της | μου | Pn | PnPo | Fe<nowiki>|</nowiki>03<nowiki>|</nowiki>Sg<nowiki>|</nowiki>Ge<nowiki>|</nowiki>Xx | 11 | Atr | _ | _ | 
 +| 13 | . | . | PUNCT | PUNCT | _ | 0 | AuxK | _ | _ |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in PADT are rare. Only 431 of the 116,793 tokens in the CoNLL 2007 version are attached nonprojectively (0.37%).+Nonprojectivities in GDT are not frequent. Only 823 of the 70223 tokens in the CoNLL 2007 version are attached nonprojectively (1.17%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Arabic:+The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al.2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Greek:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.66.91 79.34 +Nakagawa | 76.31 | 84.08 | 
-Basis (O'Neil) 66.71 78.54 +| Keith Hall et al. | 74.21 82.04 
-| Malt (Nivre et al.) | 66.71 77.52 | +Carreras 73.56 81.37 
-Edinburgh (Riedel et al.) | 66.65 78.62 |+| Malt (Nilsson et al.) | 74.65 81.22 | 
 +| Titov et al. | 73.52 | 81.20 
 +Chen | 74.42 | 81.16 | 
 +| Duan | 74.29 | 80.77 | 
 +| Attardi et al. | 73.92 | 80.75 | 
 +| Malt (J. Hall et al.) | 74.21 80.66 |
  
-The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al.2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Arabic:+The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. 
 + 
 +===== English (en) ===== 
 + 
 +[[http://www.cis.upenn.edu/~treebank/|Penn Treebank]] 
 + 
 +==== Versions ==== 
 + 
 +  * Penn Treebank 2 (1995) 
 +  * Penn Treebank 3 (1999) 
 +  * CoNLL 2007 
 +  * CoNLL 2008 
 +  * CoNLL 2009 
 + 
 +==== Obtaining and License ==== 
 + 
 +The original Penn Treebank is distributed by the LDC under the catalogue number [[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42|LDC99T42]]. It is free for LDC members 1999, price for non-members is unknown (contact LDC). The [[http://www.ldc.upenn.edu/Catalog/nonmem_agree/generic.license.html|license]] in short: 
 + 
 +  * non-commercial education and research usage 
 +  * no redistribution 
 +  * citation in publications not explicitly required but it is common decency 
 + 
 +The CoNLL 2007, 2008 and 2009 versions are also licensed by the LDC and LDC members can keep them after the shared task. Those who have not participated in the shared task may inquire at the LDC about the availability of the datasets. Their republication in LDC is planned but it has not happenned yet. 
 + 
 +The Penn Treebank was created by members of the [[http://www.cis.upenn.edu/|Department of Computer and Information Science]] (CIS), School of Engineering, University of Pennsylvania, Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104-6309, USA. The constituents-to-dependencies CoNLL 2007 conversion of the treebank was prepared by Ryan McDonald. 
 + 
 +==== References ==== 
 + 
 +  * Website 
 +    * http://www.cis.upenn.edu/~treebank/ 
 +  * Data 
 +    * Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz, Ann Taylor: //Treebank-3// ([[http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC99T42|LDC99T42]]). Linguistic Data Consortium, Philadelphia, USA, 2001. ISBN 1-58563-163-9. 
 +  * Principal publications 
 +    * Mitchell P. Marcus, Beatrice Santorini, Mary Ann Marcinkiewicz: Building a large annotated corpus of English: the Penn Treebank. //Computational Linguistics,// 19(2):313-330. 1993. 
 +  * Documentation 
 +    * [[http://www.cis.upenn.edu/~treebank/tokenization.html|Tokenization]] 
 +    * Beatrice Santorini: [[ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz|Part-of-Speech Tagging Guidelines for the Penn Treebank Project]], 3rd Revision, Philadelphia, USA, 1990. 
 +    * Ann Bies, Mark Ferguson, Karen Katz, Robert MacIntyre: [[ftp://ftp.cis.upenn.edu/pub/treebank/doc/manual/root.ps.gz|Bracketing Guidelines for Treebank II Style, Penn Treebank Project]], Philadelphia, USA, 1995. 
 +    * Robert MacIntyre: [[ftp://ftp.cis.upenn.edu/pub/treebank/doc/faq.cd2|NP Heads and Base NPs]] (Treebank FAQ) 
 +    * Richard Johansson, Pierre Nugues: [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|Extended constituent-to-dependency conversion for English]]. In: Proceedings of the 16th Nordic Conference on Computational Linguistics (NODALIDA), pp. 105-112, Tartu, Estonia, 2007. 
 + 
 +==== Domain ==== 
 + 
 +Financial news from the Wall Street Journal (1989). The constituent-based Treebank-3 also contains parsed versions of ATIS-3 and of the Brown Corpus. Only WSJ texts have been converted to dependencies for the CoNLL shared tasks. 
 + 
 +==== Size ==== 
 + 
 +CoNLL 2007: Wall Street Journal part of the Penn Treebank, sections 2-11 used for training, a subset of section 23 for testing. 
 + 
 +All distributions of PDT are officially split to training, development (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer), those that are annotated both morphologically and analytically (surface syntax; M+A layers), and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subset, which is relevant for surface dependency parsing. 
 + 
 +Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. For CoNLL 2009, only that part of PDT was selected that contained also tectogrammatical annotation, because the 2009 task included semantic learning. 
 + 
 +Parts of the following table have been taken from [[http://ufal.mff.cuni.cz/~zeman/publikace/disertace/thesis.pdf|(Zeman 2004, page 21)]]. Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones). 
 + 
 +^ Version ^ Train Sentences ^ Train Tokens ^ D-test Sentences ^ D-test Tokens ^ E-test Sentences ^ E-test Tokens ^ Total Sentences ^ Total Tokens ^ Sentence Length ^ 
 +| PDT 0.5 |     19126 |    327,597 |  3697 |    63718 |   3787 |    65390 |  26610 |    456,705 |  17.16 | 
 +| PDT 1.0 |     73088 |  1,255,590 |  7319 |  126,030 |   7507 |  125,713 |  87914 |  1,489,748 |  16.95 | 
 +| PDT 2.0 |     68562 |  1,172,299 |  9270 |  158,962 |  10148 |  173,586 |  87980 |  1,504,847 |  17.10 | 
 +| CoNLL 2006 |  72703 |  1,249,408 |   365 |     5853 |        |          |  73068 |  1,255,261 |  17.18 | 
 +| CoNLL 2007 |  25364 |    432,296 |   286 |     4724 |        |          |  25650 |    437,020 |  17.04 | 
 +| CoNLL 2009 |  38727 |    652,544 |  5228 |    87988 |   4213 |    70348 |  48168 |    810,880 |  16.83 | 
 + 
 +==== Inside ==== 
 + 
 +CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representation. The phrase structures were converted to dependency structures using the procedure described in Richard Johansson, Pierre Nugues: [[http://dspace.utlib.ee/dspace/bitstream/handle/10062/2560/reg-Johansson-10.pdf;jsessionid=BB8432D9BAE4FCF9DD9BD746704E796F?sequence=1|Extended constituent-to-dependency conversion for English]]. In: Proceedings of the 16th Nordic Conference on Computational Linguistics (NODALIDA), pp. 105-112, Tartu, Estonia, 2007. 
 + 
 +PDT 1.0 is distributed in the [[::format-csts|CSTS format]]. PDT 2.0 uses the [[::format-pml|PML format]]. CoNLL 2006 and 2007 uses the [[:format-conll|CoNLL-X format]]; CoNLL 2009 format is slightly different (number and meaning of columns). Unlike the other formats, the CSTS format used the ISO-8859-2 character encoding. 
 + 
 +The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguation. The official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology. 
 + 
 +The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columns, CPOS, POS and FEAT. In addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf|Hana and Zeman, 2005]]). 
 + 
 +See above for documentation of the morphological tags. All CoNLL distributions contain a README file with a brief description of the parts of speech and features. Use [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=cs::pdt|DZ Interset]] to inspect the PDT and the CoNLL tagsets. 
 + 
 +The guidelines for syntactic annotation are documented in the [[http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/index.html|PDT annotation manual]]. 
 + 
 +==== Sample ==== 
 + 
 +The first sentence of the PDT 1.0 training data: 
 + 
 +<code xml><csts lang=cs> 
 +<h> 
 +<source>Českomoravský profit</source> 
 +<markup> 
 +<mauth>js 
 +<mdate>1996-2000 
 +<mdesc>Manual analytical annotation 
 +</markup> 
 +<markup> 
 +<mauth>kk,lk 
 +<mdate>1996-2000 
 +<mdesc>Manual morphological annotation 
 +</markup> 
 +</h> 
 +<doc file="s/inf/j/1994/cmpr9406" id="001"> 
 +<a> 
 +<mod>
 +<txtype>inf 
 +<genre>mix 
 +<med>
 +<temp>1994 
 +<authname>
 +<opus>cmpr9406 
 +<id>001 
 +</a> 
 +<c> 
 +<p n=1> 
 +<s id="cmpr9406:001-p1s1"> 
 +<p n=2> 
 +<s id="cmpr9406:001-p2s1"> 
 +<f cap>Třikrát<l>třikrát`3<t>Cv-------------<MDl src="a">třikrát`3<MDt src="a">Cv-------------<MDl src="b">třikrát`3<MDt src="b">Cv-------------<A>Adv<r>1<g>
 +<f>rychlejší<l>rychlý<t>AAFS1----2A----<MDl src="a">rychlý<MDt src="a">AANS1----2A----<MDl src="b">rychlý<MDt src="b">AAFS1----2A----<A>ExD<r>2<g>
 +<f>než<l>než-2<t>J,-------------<MDl src="a">než-2<MDt src="a">J,-------------<MDl src="b">než-2<MDt src="b">J,-------------<A>AuxC<r>3<g>
 +<f>slovo<l>slovo<t>NNNS1-----A----<MDl src="a">slovo<MDt src="a">NNNS4-----A----<MDl src="b">slovo<MDt src="b">NNNS1-----A----<A>ExD<r>4<g>3</code> 
 + 
 +The first two sentences of the PDT 1.0 d-test data: 
 + 
 +<code xml><csts lang=cs> 
 +<h> 
 +<source>Lidové noviny</source> 
 +<markup> 
 +<mauth>zu 
 +<mdate>1996-2000 
 +<mdesc>Manual analytical annotation 
 +</markup> 
 +</h> 
 +<doc file="s/pub/nws/1994/ln94206" id="1"> 
 +<a> 
 +<mod>
 +<txtype>pub 
 +<genre>mix 
 +<med>nws 
 +<temp>1994 
 +<authname>
 +<opus>ln94206 
 +<id>
 +</a> 
 +<c> 
 +<p n=1> 
 +<s id="ln94206:1-p1s1"> 
 +<i>ti 
 +<f cap>Lidé<MDl src="a">člověk<MDt src="a">NNMP1-----A---1<MDl src="b">člověk<MDt src="b">NNMP1-----A---1<A>ExD<r>1<g>
 +<p n=2> 
 +<s id="ln94206:1-p2s1"> 
 +<f upper.abbr>ING<MDl src="a">Ing-1_:B_^(inženýr)<MDt src="a">NNMXX-----A---8<MDl src="b">Ing-1_:B_^(inženýr)<MDt src="b">NNMXX-----A---8<A>Atr<r>1<g>
 +<D> 
 +<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxG<r>2<g>
 +<f upper>PETR<MDl src="a">Petr_;Y<MDt src="a">NNMS1-----A----<MDl src="b">Petr_;Y<MDt src="b">NNMS1-----A----<A>Atr<r>3<g>
 +<f upper>KARAS<MDl src="a">karas<MDt src="a">NNMS1-----A----<MDl src="b">karas<MDt src="b">NNMS1-----A----<A>Sb_Ap<r>4<g>11 
 +<D> 
 +<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>AuxX<r>5<g>
 +<f mixed>CSc<MDl src="a">CSc-1_:B_^(kandidát_věd)<MDt src="a">NNMXX-----A---8<MDl src="b">CSc-1_:B_^(kandidát_věd)<MDt src="b">NNMXX-----A---8<A>Atr<r>6<g>
 +<D> 
 +<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxG<r>7<g>
 +<d>(<MDl src="a">(<MDt src="a">Z:-------------<MDl src="b">(<MDt src="b">Z:-------------<A>ExD<r>8<g>
 +<D> 
 +<f num>53<MDl src="a">53<MDt src="a">C=-------------<MDl src="b">53<MDt src="b">C=-------------<A>ExD_Pa<r>9<g>
 +<D> 
 +<d>)<MDl src="a">)<MDt src="a">Z:-------------<MDl src="b">)<MDt src="b">Z:-------------<A>ExD<r>10<g>
 +<D> 
 +<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>Apos<r>11<g>20 
 +<f>generální<MDl src="a">generální<MDt src="a">AAMS1----1A----<MDl src="b">generální<MDt src="b">AAMS1----1A----<A>Atr<r>12<g>13 
 +<f>ředitel<MDl src="a">ředitel<MDt src="a">NNMS1-----A----<MDl src="b">ředitel<MDt src="b">NNMS1-----A----<A>Sb_Co<r>13<g>15 
 +<f upper>ČEZ<MDl src="a">ČEZ-1_:B_;K_^(České_energetické_závody)<MDt src="a">NNIPX-----A---8<MDl src="b">ČEZ-1_:B_;K_^(České_energetické_závody)<MDt src="b">NNIPX-----A---8<A>Atr<r>14<g>13 
 +<f>a<MDl src="a">a-1<MDt src="a">J^-------------<MDl src="b">a-1<MDt src="b">J^-------------<A>Coord_Ap<r>15<g>11 
 +<f>předseda<MDl src="a">předseda<MDt src="a">NNMS1-----A----<MDl src="b">předseda<MDt src="b">NNMS1-----A----<A>Sb_Co<r>16<g>15 
 +<f>jeho<MDl src="a">jeho_^(přivlast.)<MDt src="a">PSXXXZS3-------<MDl src="b">jeho_^(přivlast.)<MDt src="b">PSXXXZS3-------<A>Atr<r>17<g>18 
 +<f>představenstva<MDl src="a">představenstvo<MDt src="a">NNNS2-----A----<MDl src="b">představenstvo<MDt src="b">NNNS2-----A----<A>Atr<r>18<g>16 
 +<D> 
 +<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>AuxX<r>19<g>11 
 +<f>je<MDl src="a">být<MDt src="a">VB-S---3P-AA---<MDl src="b">být<MDt src="b">VB-S---3P-AA---<A>Pred<r>20<g>
 +<f>absolventem<MDl src="a">absolvent<MDt src="a">NNMS7-----A----<MDl src="b">absolvent<MDt src="b">NNMS7-----A----<A>Pnom<r>21<g>20 
 +<f>elektrotechnické<MDl src="a">elektrotechnický<MDt src="a">AAFS2----1A----<MDl src="b">elektrotechnický<MDt src="b">AAFS2----1A----<A>Atr<r>22<g>23 
 +<f>fakulty<MDl src="a">fakulta<MDt src="a">NNFS2-----A----<MDl src="b">fakulta<MDt src="b">NNFS2-----A----<A>Atr_Co<r>23<g>25 
 +<f upper>ČVUT<MDl src="a">ČVUT-1_:B_;K_^(České_vysoké_učení_technické)<MDt src="a">NNNXX-----A---8<MDl src="b">ČVUT-1_:B_;K_^(České_vysoké_učení_technické)<MDt src="b">NNNXX-----A---8<A>Atr<r>24<g>23 
 +<f>a<MDl src="a">a-1<MDt src="a">J^-------------<MDl src="b">a-1<MDt src="b">J^-------------<A>Coord<r>25<g>21 
 +<f>postgraduálního<MDl src="a">postgraduální<MDt src="a">AANS2----1A----<MDl src="b">postgraduální<MDt src="b">AANS2----1A----<A>Atr<r>26<g>27 
 +<f>studia<MDl src="a">studium<MDt src="a">NNNS2-----A----<MDl src="b">studium<MDt src="b">NNNS2-----A----<A>Atr_Co<r>27<g>25 
 +<f>v<MDl src="a">v-1<MDt src="a">RR--6----------<MDl src="b">v-1<MDt src="b">RR--6----------<A>AuxP<r>28<g>29 
 +<f>oboru<MDl src="a">obor_^(lidské_činnosti)<MDt src="a">NNIS6-----A----<MDl src="b">obor_^(lidské_činnosti)<MDt src="b">NNIS6-----A----<A>AuxP<r>29<g>27 
 +<f>metod<MDl src="a">metoda<MDt src="a">NNFP2-----A----<MDl src="b">metoda<MDt src="b">NNFP2-----A----<A>Atr<r>30<g>29 
 +<f>operační<MDl src="a">operační<MDt src="a">AAFS2----1A----<MDl src="b">operační<MDt src="b">AAFS2----1A----<A>Atr<r>31<g>32 
 +<f>analýzy<MDl src="a">analýza<MDt src="a">NNFS2-----A----<MDl src="b">analýza<MDt src="b">NNFS2-----A----<A>Atr<r>32<g>30 
 +<D> 
 +<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxK<r>33<g>0</code> 
 + 
 +The first sentence of the PDT 1.0 e-test data: 
 + 
 +<code xml><csts lang=cs> 
 +<h> 
 +<source>Lidové noviny</source> 
 +<markup> 
 +<mauth>zu 
 +<mdate>1996-2000 
 +<mdesc>Manual analytical annotation 
 +</markup> 
 +</h> 
 +<doc file="s/pub/nws/1994/ln94209" id="1"> 
 +<a> 
 +<mod>
 +<txtype>pub 
 +<genre>mix 
 +<med>nws 
 +<temp>1994 
 +<authname>
 +<opus>ln94209 
 +<id>
 +</a> 
 +<c> 
 +<p n=1> 
 +<s id="ln94209:1-p1s1"> 
 +<f cap>Přádelny<MDl src="a">přádelna<MDt src="a">NNFP1-----A----<MDl src="b">přádelna<MDt src="b">NNFP1-----A----<A>Sb<r>1<g>
 +<f>mají<MDl src="a">mít<MDt src="a">VB-P---3P-AA---<MDl src="b">mít<MDt src="b">VB-P---3P-AA---<A>Pred<r>2<g>
 +<f>dvojnásob<MDl src="a">dvojnásob<MDt src="a">Db-------------<MDl src="b">dvojnásob<MDt src="b">Db-------------<A>Obj<r>3<g>
 +<f>vad<MDl src="a">vada<MDt src="a">NNFP2-----A----<MDl src="b">vada<MDt src="b">NNFP2-----A----<A>Atr<r>4<g>3</code> 
 + 
 +Morphological annotation of the first amw training file of the PDT 2.0: 
 + 
 +<code xml><mdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> 
 + <head> 
 +  <schema href="mdata_schema.xml" /> 
 +  <references> 
 +   <reffile id="w" name="wdata" href="cmpr9406_001.w.gz" /> 
 +  </references> 
 + </head> 
 + <meta> 
 +  <lang>cs</lang> 
 +  <annotation_info id="manual"> 
 +   <desc>Manual annotation</desc> 
 +  </annotation_info> 
 + </meta> 
 + <s id="m-cmpr9406-001-p2s1"> 
 +  <m id="m-cmpr9406-001-p2s1w1"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w1</w.rf> 
 +   <form>Třikrát</form> 
 +   <lemma>třikrát`3</lemma> 
 +   <tag>Cv-------------</tag> 
 +  </m> 
 +  <m id="m-cmpr9406-001-p2s1w2"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w2</w.rf> 
 +   <form>rychlejší</form> 
 +   <lemma>rychlý</lemma> 
 +   <tag>AAFS1----2A----</tag> 
 +  </m> 
 +  <m id="m-cmpr9406-001-p2s1w3"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w3</w.rf> 
 +   <form>než</form> 
 +   <lemma>než-2</lemma> 
 +   <tag>J,-------------</tag> 
 +  </m> 
 +  <m id="m-cmpr9406-001-p2s1w4"> 
 +   <src.rf>manual</src.rf> 
 +   <w.rf>w#w-cmpr9406-001-p2s1w4</w.rf> 
 +   <form>slovo</form> 
 +   <lemma>slovo</lemma> 
 +   <tag>NNNS1-----A----</tag> 
 +  </m> 
 + </s></code> 
 + 
 +Analytical (surface-syntactic) annotation of the first amw training file of the PDT 2.0: 
 + 
 +<code xml><adata xmlns="http://ufal.mff.cuni.cz/pdt/pml/"> 
 + <head> 
 +  <schema href="adata_schema.xml" /> 
 +  <references> 
 +   <reffile id="m" name="mdata" href="cmpr9406_001.m.gz" /> 
 +   <reffile id="w" name="wdata" href="cmpr9406_001.w.gz" /> 
 +  </references> 
 + </head> 
 + <meta> 
 +  <annotation_info> 
 +   <desc>Manual annotation</desc> 
 +  </annotation_info> 
 + </meta> 
 + <trees> 
 +  <LM id="a-cmpr9406-001-p2s1"> 
 +   <s.rf>m#m-cmpr9406-001-p2s1</s.rf> 
 +   <ord>0</ord> 
 +   <children> 
 +    <LM id="a-cmpr9406-001-p2s1w2"> 
 +     <m.rf>m#m-cmpr9406-001-p2s1w2</m.rf> 
 +     <afun>ExD</afun> 
 +     <ord>2</ord> 
 +     <children> 
 +      <LM id="a-cmpr9406-001-p2s1w1"> 
 +       <m.rf>m#m-cmpr9406-001-p2s1w1</m.rf> 
 +       <afun>Adv</afun> 
 +       <ord>1</ord> 
 +      </LM> 
 +      <LM id="a-cmpr9406-001-p2s1w3"> 
 +       <m.rf>m#m-cmpr9406-001-p2s1w3</m.rf> 
 +       <afun>AuxC</afun> 
 +       <ord>3</ord> 
 +       <children> 
 +        <LM id="a-cmpr9406-001-p2s1w4"> 
 +         <m.rf>m#m-cmpr9406-001-p2s1w4</m.rf> 
 +         <afun>ExD</afun> 
 +         <ord>4</ord> 
 +        </LM> 
 +       </children> 
 +      </LM> 
 +     </children> 
 +    </LM> 
 +   </children> 
 +  </LM></code> 
 + 
 +The first two sentences of the CoNLL 2006 and 2007 training data: 
 + 
 +| 1 | Třikrát | třikrát`3 | C | v | _ | 2 | Adv | _ | _ | 
 +| 2 | rychlejší | rychlý | A | A | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=2<nowiki>|</nowiki>Neg=A | 0 | ExD | _ | _ | 
 +| 3 | než | než-2 | J | , | _ | 2 | AuxC | _ | _ | 
 +| 4 | slovo | slovo | N | N | Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 3 | ExD | _ | _ | 
 +| |||||||||| 
 +| 1 | Faxu | fax | N | N | Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=3<nowiki>|</nowiki>Neg=A | 2 | Obj | _ | _ | 
 +| 2 | škodí | škodit | V | B | Num=P<nowiki>|</nowiki>Per=3<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 0 | Pred | _ | _ | 
 +| 3 | především | především | D | b | _ | 6 | AuxZ | _ | _ | 
 +| 4 | přetížené | přetížený | A | A | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 6 | Atr | _ | _ | 
 +| 5 | telefonní | telefonní | A | A | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 6 | Atr | _ | _ | 
 +| 6 | linky | linka | N | N | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 2 | Sb | _ | _ | 
 +| 7 | * | * | Z | : | _ | 2 | AuxG | _ | _ | 
 + 
 +The first sentence of the CoNLL 2006 test data: 
 + 
 +| 1 | Podobně | podobně | D | g | Gra=1<nowiki>|</nowiki>Neg=A | 5 | Adv | _ | _ | 
 +| 2 | , | , | Z | : | _ | 3 | AuxX | _ | _ | 
 +| 3 | myslím | myslit | V | B | Num=S<nowiki>|</nowiki>Per=1<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 5 | Pred_Pa | _ | _ | 
 +| 4 | , | , | Z | : | _ | 3 | AuxX | _ | _ | 
 +| 5 | postupuje | postupovat | V | B | Num=S<nowiki>|</nowiki>Per=3<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 0 | Pred | _ | _ | 
 +| 6 | většina | většina | N | N | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 5 | Sb | _ | _ | 
 +| 7 | českých | český | A | A | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 8 | Atr | _ | _ | 
 +| 8 | bank | banka | N | N | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Neg=A | 6 | Atr | _ | _ | 
 +| 9 | , | , | Z | : | _ | 11 | AuxX | _ | _ | 
 +| 10 | zejména | zejména | D | b | _ | 12 | AuxZ | _ | _ | 
 +| 11 | v | v-1 | R | R | Cas=6 | 5 | AuxP | _ | _ | 
 +| 12 | případech | případ | N | N | Gen=I<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=6<nowiki>|</nowiki>Neg=A | 11 | Adv | _ | _ | 
 +| 13 | , | , | Z | : | _ | 17 | AuxX | _ | _ | 
 +| 14 | kdy | kdy | D | b | _ | 17 | Adv | _ | _ | 
 +| 15 | by | být | V | c | Num=X<nowiki>|</nowiki>Per=3 | 17 | AuxV | _ | _ | 
 +| 16 | se | se | P | 7 | Num=X<nowiki>|</nowiki>Cas=4 | 18 | AuxT | _ | _ | 
 +| 17 | mělo | mít | V | p | Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Per=X<nowiki>|</nowiki>Ten=R<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 12 | Atr | _ | _ | 
 +| 18 | jednat | jednat | V | f | Neg=A | 17 | Obj | _ | _ | 
 +| 19 | o | o-1 | R | R | Cas=4 | 18 | AuxP | _ | _ | 
 +| 20 | větší | velký | A | A | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Gra=2<nowiki>|</nowiki>Neg=A | 21 | Atr | _ | _ | 
 +| 21 | částky | částka | N | N | Gen=F<nowiki>|</nowiki>Num=P<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 19 | Obj | _ | _ | 
 +| 22 | . | . | Z | : | _ | 0 | AuxK | _ | _ | 
 + 
 +The first sentence of the CoNLL 2007 test data: 
 + 
 +| 1 | Proč | proč | D | b | _ | 2 | Adv | _ | _ | 
 +| 2 | mají | mít | V | B | Num=P<nowiki>|</nowiki>Per=3<nowiki>|</nowiki>Ten=P<nowiki>|</nowiki>Neg=A<nowiki>|</nowiki>Voi=A | 0 | Pred | _ | _ | 
 +| 3 | každý | každý | A | A | Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 4 | Atr | _ | _ | 
 +| 4 | rok | rok | N | N | Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 5 | Adv | _ | _ | 
 +| 5 | fasovat | fasovat | V | f | Neg=A | 2 | Obj | _ | _ | 
 +| 6 | speciální | speciální | A | A | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 7 | Atr | _ | _ | 
 +| 7 | taxu | taxa | N | N | Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 5 | Obj | _ | _ | 
 +| 8 | na | na | R | R | Cas=4 | 7 | AuxP | _ | _ | 
 +| 9 | oblečení | oblečení | N | N | Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=4<nowiki>|</nowiki>Neg=A | 8 | AtrAdv | _ | _ | 
 +| 10 | ? | ? | Z | : | _ | 0 | AuxK | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 training data: 
 + 
 +| 1 | Celní | celní | celní | A | A | SubPOS=A<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | SubPOS=A<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 2 | 2 | Atr | Atr | Y | celní | _ | RSTR | _ | 
 +| 2 | unie | unie | unie | N | N | SubPOS=N<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=F<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 0 | 0 | ExD | ExD | Y | unie | _ | _ | _ | 
 +| 3 | v | v | v | R | R | SubPOS=R<nowiki>|</nowiki>Cas=6 | SubPOS=R<nowiki>|</nowiki>Cas=6 | 2 | 2 | AuxP | AuxP | _ | _ | _ | _ | _ | 
 +| 4 | ohrožení | ohrožení | ohrožení | N | N | SubPOS=N<nowiki>|</nowiki>Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=6<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=N<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=6<nowiki>|</nowiki>Neg=A | 3 | 3 | Atr | Atr | Y | v-w3017f1 | _ | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 development data: 
 + 
 +| 1 | <nowiki>|</nowiki> | <nowiki>|</nowiki> | <nowiki>|</nowiki> | Z | Z | SubPOS=: | SubPOS=: | 0 | 3 | ExD | AuxG | _ | _ | _ | _ | 
 +| 2 | Daňový | daňový | daňový | A | A | SubPOS=A<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | SubPOS=A<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Gra=1<nowiki>|</nowiki>Neg=A | 3 | 3 | Atr | Atr | Y | daňový | _ | RSTR | 
 +| 3 | poradce | poradce | poradce | N | N | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | 0 | 0 | ExD | ExD | Y | poradce | _ | _ | 
 +| 4 | <nowiki>|</nowiki> | <nowiki>|</nowiki> | <nowiki>|</nowiki> | Z | Z | SubPOS=: | SubPOS=: | 0 | 3 | AuxK | AuxG | _ | _ | _ | _ | 
 + 
 +The first sentence of the CoNLL 2009 test data: 
 + 
 +| 1 | Názor | názor | názor | N | N | SubPOS=N<nowiki>|</nowiki>Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=I<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=1<nowiki>|</nowiki>Neg=A | _ | _ | _ | _ | Y | 
 +| 2 | experta | expert | expert | N | N | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Neg=A | SubPOS=N<nowiki>|</nowiki>Gen=M<nowiki>|</nowiki>Num=S<nowiki>|</nowiki>Cas=2<nowiki>|</nowiki>Neg=A | _ | _ | _ | _ | Y | 
 + 
 +==== Parsing ==== 
 + 
 +PDT is a mildly nonprojective treebank. 8351 of the 437,020 tokens in the CoNLL 2007 version are attached nonprojectively (1.91%). 
 + 
 +There is an [[http://ufal.mff.cuni.cz/czech-parsing/|online summary]] of known results in Czech parsing. 
 + 
 +The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Czech:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-Malt (Nilsson et al.) | 76.52 85.81 +MST (McDonald et al.) | 80.18 87.30 
-Nakagawa 75.08 86.09 +Basis (O'Neil) 76.60 85.58 
-| Malt (Hall et al.) | 74.75 | 84.21 +| Malt (Nivre et al.) | 78.42 | 84.80 
-Sagae 74.71 | 84.04 +Nara (Yuchang Cheng) 76.24 | 83.40 | 
-Chen 74.65 | 83.49 + 
-Titov et al. | 74.12 | 83.18 |+The results of the CoNLL 2007 shared task are [[http://nextens.uvt.nl/depparse-wiki/AllScores|available online]]. They have been published in [[http://aclweb.org/anthology-new/D/D07/D07-1096.pdf|(Nivre et al., 2007)]]. The evaluation procedure was changed to include punctuation tokens. These are the best results for Czech: 
 + 
 +^ Parser (Authors) ^ LAS ^ UAS ^ 
 +| Nakagawa | 80.19 | 86.28 | 
 +| Carreras | 78.60 | 85.16 | 
 +| Titov et al. | 77.94 | 84.19 
 +Malt (Nilsson et al.) 77.98 | 83.59 
 +Attardi et al. | 77.37 | 83.40 | 
 +| Malt (Hall et al.) | 77.22 | 82.35 |
  
 The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]]. The two Malt parser results of 2007 (single malt and blended) are described in [[http://aclweb.org/anthology-new/D/D07/D07-1097.pdf|(Hall et al., 2007)]] and the details about the parser configuration are described [[http://w3.msi.vxu.se/users/jha/conll07/|here]].
 +
 +The results of the CoNLL 2009 shared task are [[http://ufal.mff.cuni.cz/conll2009-st/results/results.php|available online]]. They have been published in [[http://aclweb.org/anthology/W/W09/W09-1201.pdf|(Hajič et al., 2009)]]. Unlabeled attachment score was not published. These are the best results for Czech:
 +
 +^ Parser (Authors) ^ LAS ^
 +| Merlo (Gesmundo et al.) | 80.38 |
 +| Bohnet | 80.11 |
 +| Che et al. | 80.01 |
  

[ Back to the navigation ] [ Back to the content ]