[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki

[ Back to the navigation ]

Table of Contents

Czech (cs)

Prague Dependency Treebank (PDT)


The CoNLL 2006 version is based on PDT 1.0. The CoNLL 2007 and 2009 versions are based on PDT 2.0.

Obtaining and License

The original PDT 1.0 and 2.0 is distributed by the LDC under the catalogue numbers LDC2001T10 and LDC2006T01. It is free for LDC members 2001 and 2006, price for non-members is unknown (contact LDC). The license in short:

The CoNLL 2006, 2007 and 2009 versions are obtainable upon request under similar license terms. Their publication in the LDC together with the other CoNLL treebanks is being prepared.

The PDT 2.5 version is available for download from the LINDAT CLARIN repository under the Creative Commons 3.0-BY-NC-SA license, i.e.:

PDT was created by members of the Institute of Formal and Applied Linguistics (Ústav formální a aplikované lingvistiky, ÚFAL), Faculty of Mathematics and Physics (Matematicko-fyzikální fakulta), Charles University in Prague (Univerzita Karlova v Praze), Malostranské náměstí 25, Praha, CZ-11800, Czechia. The CoNLL 2006 conversion of the treebank was prepared by Yuval Krymolowski; the CoNLL 2007 and 2009 conversions were prepared by ÚFAL (Zdeněk Žabokrtský and Jan Štěpánek).



Newswire text (Lidové noviny, Mladá fronta Dnes), business weekly (Českomoravský Profit) and a scientific magazine (Vesmír).


All distributions of PDT are officially split to training, development (d-test) and test (e-test) data sets. PDT 2.0 contains data that are annotated only morphologically (M-layer), those that are annotated both morphologically and analytically (surface syntax; M+A layers), and the smallest subset is also annotated tectogrammatically (M+A+T layers). The statistics in this section cover the M+A subset, which is relevant for surface dependency parsing.

Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. For CoNLL 2009, only that part of PDT was selected that contained also tectogrammatical annotation, because the 2009 task included semantic learning.

Parts of the following table have been taken from (Zeman 2004, page 21). Only non-empty sentences counted (e.g. PDT 1.0 had 81614 sentence tags but only 73088 non-empty ones).

PDT 3.0 also distinguishes d-test and e-test but I currently have counts from train and d-test summed up. To be updated…

Version Train Sentences Train Tokens D-test Sentences D-test Tokens E-test Sentences E-test Tokens Total Sentences Total Tokens Sentence Length
PDT 0.5 19126 327,597 3697 63718 3787 65390 26610 456,705 17.16
PDT 1.0 73088 1,255,590 7319 126,030 7507 125,713 87914 1,489,748 16.95
PDT 2.0 68562 1,172,299 9270 158,962 10148 173,586 87980 1,504,847 17.10
PDT 3.0 77765 1,330,152 train train 10148 173,586 87913 1,503,738 17.10
CoNLL 2006 72703 1,249,408 365 5853 73068 1,255,261 17.18
CoNLL 2007 25364 432,296 286 4724 25650 437,020 17.04
CoNLL 2009 38727 652,544 5228 87988 4213 70348 48168 810,880 16.83


PDT 1.0 is distributed in the CSTS format. PDT 2.0 uses the PML format. CoNLL 2006 and 2007 uses the CoNLL-X format; CoNLL 2009 format is slightly different (number and meaning of columns). Unlike the other formats, the CSTS format used the ISO-8859-2 character encoding.

The CSTS format (PDT 0.5 and 1.0) contains morphological annotation (lemmas and tags) both manual and by two taggers. The CoNLL 2009 version contains manual and one automatic disambiguation. The official distribution of PDT 2.0 and the CoNLL 2006 and 2007 versions contain only manual morphology.

The original PDT uses 15-character positional morphological tags. The CoNLL versions convert the tags to the two/three CoNLL columns, CPOS, POS and FEAT. In addition, the CoNLL versions contain the Sem feature, which is derived from the tags attached to lemma in PDT (see Hana and Zeman, 2005).

See above for documentation of the morphological tags. All CoNLL distributions contain a README file with a brief description of the parts of speech and features. Use DZ Interset to inspect the PDT and the CoNLL tagsets.

The guidelines for syntactic annotation are documented in the PDT annotation manual.


The first sentence of the PDT 1.0 training data:

<csts lang=cs>
<source>Českomoravský profit</source>
<mdesc>Manual analytical annotation
<mdesc>Manual morphological annotation
<doc file="s/inf/j/1994/cmpr9406" id="001">
<p n=1>
<s id="cmpr9406:001-p1s1">
<p n=2>
<s id="cmpr9406:001-p2s1">
<f cap>Třikrát<l>třikrát`3<t>Cv-------------<MDl src="a">třikrát`3<MDt src="a">Cv-------------<MDl src="b">třikrát`3<MDt src="b">Cv-------------<A>Adv<r>1<g>2
<f>rychlejší<l>rychlý<t>AAFS1----2A----<MDl src="a">rychlý<MDt src="a">AANS1----2A----<MDl src="b">rychlý<MDt src="b">AAFS1----2A----<A>ExD<r>2<g>0
<f>než<l>než-2<t>J,-------------<MDl src="a">než-2<MDt src="a">J,-------------<MDl src="b">než-2<MDt src="b">J,-------------<A>AuxC<r>3<g>2
<f>slovo<l>slovo<t>NNNS1-----A----<MDl src="a">slovo<MDt src="a">NNNS4-----A----<MDl src="b">slovo<MDt src="b">NNNS1-----A----<A>ExD<r>4<g>3

The first two sentences of the PDT 1.0 d-test data:

<csts lang=cs>
<source>Lidové noviny</source>
<mdesc>Manual analytical annotation
<doc file="s/pub/nws/1994/ln94206" id="1">
<p n=1>
<s id="ln94206:1-p1s1">
<f cap>Lidé<MDl src="a">člověk<MDt src="a">NNMP1-----A---1<MDl src="b">člověk<MDt src="b">NNMP1-----A---1<A>ExD<r>1<g>0
<p n=2>
<s id="ln94206:1-p2s1">
<f upper.abbr>ING<MDl src="a">Ing-1_:B_^(inženýr)<MDt src="a">NNMXX-----A---8<MDl src="b">Ing-1_:B_^(inženýr)<MDt src="b">NNMXX-----A---8<A>Atr<r>1<g>4
<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxG<r>2<g>1
<f upper>PETR<MDl src="a">Petr_;Y<MDt src="a">NNMS1-----A----<MDl src="b">Petr_;Y<MDt src="b">NNMS1-----A----<A>Atr<r>3<g>4
<f upper>KARAS<MDl src="a">karas<MDt src="a">NNMS1-----A----<MDl src="b">karas<MDt src="b">NNMS1-----A----<A>Sb_Ap<r>4<g>11
<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>AuxX<r>5<g>6
<f mixed>CSc<MDl src="a">CSc-1_:B_^(kandidát_věd)<MDt src="a">NNMXX-----A---8<MDl src="b">CSc-1_:B_^(kandidát_věd)<MDt src="b">NNMXX-----A---8<A>Atr<r>6<g>4
<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxG<r>7<g>6
<d>(<MDl src="a">(<MDt src="a">Z:-------------<MDl src="b">(<MDt src="b">Z:-------------<A>ExD<r>8<g>4
<f num>53<MDl src="a">53<MDt src="a">C=-------------<MDl src="b">53<MDt src="b">C=-------------<A>ExD_Pa<r>9<g>4
<d>)<MDl src="a">)<MDt src="a">Z:-------------<MDl src="b">)<MDt src="b">Z:-------------<A>ExD<r>10<g>4
<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>Apos<r>11<g>20
<f>generální<MDl src="a">generální<MDt src="a">AAMS1----1A----<MDl src="b">generální<MDt src="b">AAMS1----1A----<A>Atr<r>12<g>13
<f>ředitel<MDl src="a">ředitel<MDt src="a">NNMS1-----A----<MDl src="b">ředitel<MDt src="b">NNMS1-----A----<A>Sb_Co<r>13<g>15
<f upper>ČEZ<MDl src="a">ČEZ-1_:B_;K_^(České_energetické_závody)<MDt src="a">NNIPX-----A---8<MDl src="b">ČEZ-1_:B_;K_^(České_energetické_závody)<MDt src="b">NNIPX-----A---8<A>Atr<r>14<g>13
<f>a<MDl src="a">a-1<MDt src="a">J^-------------<MDl src="b">a-1<MDt src="b">J^-------------<A>Coord_Ap<r>15<g>11
<f>předseda<MDl src="a">předseda<MDt src="a">NNMS1-----A----<MDl src="b">předseda<MDt src="b">NNMS1-----A----<A>Sb_Co<r>16<g>15
<f>jeho<MDl src="a">jeho_^(přivlast.)<MDt src="a">PSXXXZS3-------<MDl src="b">jeho_^(přivlast.)<MDt src="b">PSXXXZS3-------<A>Atr<r>17<g>18
<f>představenstva<MDl src="a">představenstvo<MDt src="a">NNNS2-----A----<MDl src="b">představenstvo<MDt src="b">NNNS2-----A----<A>Atr<r>18<g>16
<d>,<MDl src="a">,<MDt src="a">Z:-------------<MDl src="b">,<MDt src="b">Z:-------------<A>AuxX<r>19<g>11
<f>je<MDl src="a">být<MDt src="a">VB-S---3P-AA---<MDl src="b">být<MDt src="b">VB-S---3P-AA---<A>Pred<r>20<g>0
<f>absolventem<MDl src="a">absolvent<MDt src="a">NNMS7-----A----<MDl src="b">absolvent<MDt src="b">NNMS7-----A----<A>Pnom<r>21<g>20
<f>elektrotechnické<MDl src="a">elektrotechnický<MDt src="a">AAFS2----1A----<MDl src="b">elektrotechnický<MDt src="b">AAFS2----1A----<A>Atr<r>22<g>23
<f>fakulty<MDl src="a">fakulta<MDt src="a">NNFS2-----A----<MDl src="b">fakulta<MDt src="b">NNFS2-----A----<A>Atr_Co<r>23<g>25
<f upper>ČVUT<MDl src="a">ČVUT-1_:B_;K_^(České_vysoké_učení_technické)<MDt src="a">NNNXX-----A---8<MDl src="b">ČVUT-1_:B_;K_^(České_vysoké_učení_technické)<MDt src="b">NNNXX-----A---8<A>Atr<r>24<g>23
<f>a<MDl src="a">a-1<MDt src="a">J^-------------<MDl src="b">a-1<MDt src="b">J^-------------<A>Coord<r>25<g>21
<f>postgraduálního<MDl src="a">postgraduální<MDt src="a">AANS2----1A----<MDl src="b">postgraduální<MDt src="b">AANS2----1A----<A>Atr<r>26<g>27
<f>studia<MDl src="a">studium<MDt src="a">NNNS2-----A----<MDl src="b">studium<MDt src="b">NNNS2-----A----<A>Atr_Co<r>27<g>25
<f>v<MDl src="a">v-1<MDt src="a">RR--6----------<MDl src="b">v-1<MDt src="b">RR--6----------<A>AuxP<r>28<g>29
<f>oboru<MDl src="a">obor_^(lidské_činnosti)<MDt src="a">NNIS6-----A----<MDl src="b">obor_^(lidské_činnosti)<MDt src="b">NNIS6-----A----<A>AuxP<r>29<g>27
<f>metod<MDl src="a">metoda<MDt src="a">NNFP2-----A----<MDl src="b">metoda<MDt src="b">NNFP2-----A----<A>Atr<r>30<g>29
<f>operační<MDl src="a">operační<MDt src="a">AAFS2----1A----<MDl src="b">operační<MDt src="b">AAFS2----1A----<A>Atr<r>31<g>32
<f>analýzy<MDl src="a">analýza<MDt src="a">NNFS2-----A----<MDl src="b">analýza<MDt src="b">NNFS2-----A----<A>Atr<r>32<g>30
<d>.<MDl src="a">.<MDt src="a">Z:-------------<MDl src="b">.<MDt src="b">Z:-------------<A>AuxK<r>33<g>0

The first sentence of the PDT 1.0 e-test data:

<csts lang=cs>
<source>Lidové noviny</source>
<mdesc>Manual analytical annotation
<doc file="s/pub/nws/1994/ln94209" id="1">
<p n=1>
<s id="ln94209:1-p1s1">
<f cap>Přádelny<MDl src="a">přádelna<MDt src="a">NNFP1-----A----<MDl src="b">přádelna<MDt src="b">NNFP1-----A----<A>Sb<r>1<g>2
<f>mají<MDl src="a">mít<MDt src="a">VB-P---3P-AA---<MDl src="b">mít<MDt src="b">VB-P---3P-AA---<A>Pred<r>2<g>0
<f>dvojnásob<MDl src="a">dvojnásob<MDt src="a">Db-------------<MDl src="b">dvojnásob<MDt src="b">Db-------------<A>Obj<r>3<g>2
<f>vad<MDl src="a">vada<MDt src="a">NNFP2-----A----<MDl src="b">vada<MDt src="b">NNFP2-----A----<A>Atr<r>4<g>3

Morphological annotation of the first amw training file of the PDT 2.0:

<mdata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
  <schema href="mdata_schema.xml" />
   <reffile id="w" name="wdata" href="cmpr9406_001.w.gz" />
  <annotation_info id="manual">
   <desc>Manual annotation</desc>
 <s id="m-cmpr9406-001-p2s1">
  <m id="m-cmpr9406-001-p2s1w1">
  <m id="m-cmpr9406-001-p2s1w2">
  <m id="m-cmpr9406-001-p2s1w3">
  <m id="m-cmpr9406-001-p2s1w4">

Analytical (surface-syntactic) annotation of the first amw training file of the PDT 2.0:

<adata xmlns="http://ufal.mff.cuni.cz/pdt/pml/">
  <schema href="adata_schema.xml" />
   <reffile id="m" name="mdata" href="cmpr9406_001.m.gz" />
   <reffile id="w" name="wdata" href="cmpr9406_001.w.gz" />
   <desc>Manual annotation</desc>
  <LM id="a-cmpr9406-001-p2s1">
    <LM id="a-cmpr9406-001-p2s1w2">
      <LM id="a-cmpr9406-001-p2s1w1">
      <LM id="a-cmpr9406-001-p2s1w3">
        <LM id="a-cmpr9406-001-p2s1w4">

The first two sentences of the CoNLL 2006 and 2007 training data:

1 Třikrát třikrát`3 C v _ 2 Adv _ _
2 rychlejší rychlý A A Gen=F|Num=S|Cas=1|Gra=2|Neg=A 0 ExD _ _
3 než než-2 J , _ 2 AuxC _ _
4 slovo slovo N N Gen=N|Num=S|Cas=1|Neg=A 3 ExD _ _
1 Faxu fax N N Gen=I|Num=S|Cas=3|Neg=A 2 Obj _ _
2 škodí škodit V B Num=P|Per=3|Ten=P|Neg=A|Voi=A 0 Pred _ _
3 především především D b _ 6 AuxZ _ _
4 přetížené přetížený A A Gen=F|Num=P|Cas=1|Gra=1|Neg=A 6 Atr _ _
5 telefonní telefonní A A Gen=F|Num=P|Cas=1|Gra=1|Neg=A 6 Atr _ _
6 linky linka N N Gen=F|Num=P|Cas=1|Neg=A 2 Sb _ _
7 * * Z : _ 2 AuxG _ _

The first sentence of the CoNLL 2006 test data:

1 Podobně podobně D g Gra=1|Neg=A 5 Adv _ _
2 , , Z : _ 3 AuxX _ _
3 myslím myslit V B Num=S|Per=1|Ten=P|Neg=A|Voi=A 5 Pred_Pa _ _
4 , , Z : _ 3 AuxX _ _
5 postupuje postupovat V B Num=S|Per=3|Ten=P|Neg=A|Voi=A 0 Pred _ _
6 většina většina N N Gen=F|Num=S|Cas=1|Neg=A 5 Sb _ _
7 českých český A A Gen=F|Num=P|Cas=2|Gra=1|Neg=A 8 Atr _ _
8 bank banka N N Gen=F|Num=P|Cas=2|Neg=A 6 Atr _ _
9 , , Z : _ 11 AuxX _ _
10 zejména zejména D b _ 12 AuxZ _ _
11 v v-1 R R Cas=6 5 AuxP _ _
12 případech případ N N Gen=I|Num=P|Cas=6|Neg=A 11 Adv _ _
13 , , Z : _ 17 AuxX _ _
14 kdy kdy D b _ 17 Adv _ _
15 by být V c Num=X|Per=3 17 AuxV _ _
16 se se P 7 Num=X|Cas=4 18 AuxT _ _
17 mělo mít V p Gen=N|Num=S|Per=X|Ten=R|Neg=A|Voi=A 12 Atr _ _
18 jednat jednat V f Neg=A 17 Obj _ _
19 o o-1 R R Cas=4 18 AuxP _ _
20 větší velký A A Gen=F|Num=P|Cas=4|Gra=2|Neg=A 21 Atr _ _
21 částky částka N N Gen=F|Num=P|Cas=4|Neg=A 19 Obj _ _
22 . . Z : _ 0 AuxK _ _

The first sentence of the CoNLL 2007 test data:

1 Proč proč D b _ 2 Adv _ _
2 mají mít V B Num=P|Per=3|Ten=P|Neg=A|Voi=A 0 Pred _ _
3 každý každý A A Gen=I|Num=S|Cas=4|Gra=1|Neg=A 4 Atr _ _
4 rok rok N N Gen=I|Num=S|Cas=4|Neg=A 5 Adv _ _
5 fasovat fasovat V f Neg=A 2 Obj _ _
6 speciální speciální A A Gen=F|Num=S|Cas=4|Gra=1|Neg=A 7 Atr _ _
7 taxu taxa N N Gen=F|Num=S|Cas=4|Neg=A 5 Obj _ _
8 na na R R Cas=4 7 AuxP _ _
9 oblečení oblečení N N Gen=N|Num=S|Cas=4|Neg=A 8 AtrAdv _ _
10 ? ? Z : _ 0 AuxK _ _

The first sentence of the CoNLL 2009 training data:

1 Celní celní celní A A SubPOS=A|Gen=F|Num=S|Cas=1|Gra=1|Neg=A SubPOS=A|Gen=F|Num=S|Cas=1|Gra=1|Neg=A 2 2 Atr Atr Y celní _ RSTR _
2 unie unie unie N N SubPOS=N|Gen=F|Num=S|Cas=1|Neg=A SubPOS=N|Gen=F|Num=S|Cas=1|Neg=A 0 0 ExD ExD Y unie _ _ _
3 v v v R R SubPOS=R|Cas=6 SubPOS=R|Cas=6 2 2 AuxP AuxP _ _ _ _ _
4 ohrožení ohrožení ohrožení N N SubPOS=N|Gen=N|Num=S|Cas=6|Neg=A SubPOS=N|Gen=N|Num=S|Cas=6|Neg=A 3 3 Atr Atr Y v-w3017f1 _ _ _

The first sentence of the CoNLL 2009 development data:

1 | | | Z Z SubPOS=: SubPOS=: 0 3 ExD AuxG _ _ _ _
2 Daňový daňový daňový A A SubPOS=A|Gen=M|Num=S|Cas=1|Gra=1|Neg=A SubPOS=A|Gen=M|Num=S|Cas=1|Gra=1|Neg=A 3 3 Atr Atr Y daňový _ RSTR
3 poradce poradce poradce N N SubPOS=N|Gen=M|Num=S|Cas=1|Neg=A SubPOS=N|Gen=M|Num=S|Cas=1|Neg=A 0 0 ExD ExD Y poradce _ _
4 | | | Z Z SubPOS=: SubPOS=: 0 3 AuxK AuxG _ _ _ _

The first sentence of the CoNLL 2009 test data:

1 Názor názor názor N N SubPOS=N|Gen=I|Num=S|Cas=1|Neg=A SubPOS=N|Gen=I|Num=S|Cas=1|Neg=A _ _ _ _ Y
2 experta expert expert N N SubPOS=N|Gen=M|Num=S|Cas=2|Neg=A SubPOS=N|Gen=M|Num=S|Cas=2|Neg=A _ _ _ _ Y


PDT is a mildly nonprojective treebank. 8351 of the 437,020 tokens in the CoNLL 2007 version are attached nonprojectively (1.91%).

There is an online summary of known results in Czech parsing.

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Czech:

Parser (Authors) LAS UAS
MST (McDonald et al.) 80.18 87.30
Basis (O'Neil) 76.60 85.58
Malt (Nivre et al.) 78.42 84.80
Nara (Yuchang Cheng) 76.24 83.40

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Czech:

Parser (Authors) LAS UAS
Nakagawa 80.19 86.28
Carreras 78.60 85.16
Titov et al. 77.94 84.19
Malt (Nilsson et al.) 77.98 83.59
Attardi et al. 77.37 83.40
Malt (Hall et al.) 77.22 82.35

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

The results of the CoNLL 2009 shared task are available online. They have been published in (Hajič et al., 2009). Unlabeled attachment score was not published. These are the best results for Czech:

Parser (Authors) LAS
Merlo (Gesmundo et al.) 80.38
Bohnet 80.11
Che et al. 80.01

[ Back to the navigation ] [ Back to the content ]