[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

English (en)

Penn Treebank

Versions

Obtaining and License

The original Penn Treebank is distributed by the LDC under the catalogue number LDC99T42. It is free for LDC members 1999, price for non-members is unknown (contact LDC). The license in short:

The CoNLL 2007, 2008 and 2009 versions are also licensed by the LDC and LDC members can keep them after the shared task. Those who have not participated in the shared task may inquire at the LDC about the availability of the datasets. Their republication in LDC is planned but it has not happenned yet.

The Penn Treebank was created by members of the Department of Computer and Information Science (CIS), School of Engineering, University of Pennsylvania, Levine Hall, 3330 Walnut Street, Philadelphia, PA 19104-6309, USA. The constituents-to-dependencies CoNLL 2007 conversion of the treebank was prepared by Ryan McDonald.

References

Domain

Financial news from the Wall Street Journal (1989). The constituent-based Treebank-3 also contains parsed versions of ATIS-3 and of the Brown Corpus. Only WSJ texts have been converted to dependencies for the CoNLL shared tasks.

Size

Size of CoNLL 2007 data was limited because some teams of CoNLL 2006 complained that they did not have enough time and resources to train the larger models. Sections 2-11 of the Wall Street Journal part of the treebank were used for training and a subset of section 23 was used for testing.

Version Train Sentences Train Tokens D-test Sentences D-test Tokens E-test Sentences E-test Tokens Total Sentences Total Tokens Sentence Length
CoNLL 2007 18577 446,573 214 5003 18791 451,576 24.03
CoNLL 2009 39279 958,167 1334 33368 2399 57676 43012 1,049,211 24.39

Inside

The original Penn Treebank uses the Penn MRG ("merged") bracketing format. CoNLL 2007 uses the CoNLL-X format; CoNLL 2008 and 2009 format is slightly different (number and meaning of columns).

Conversion for CoNLL 2007: Many function tags were removed from the non-terminals in the phrase-structure representation. The phrase structures were converted to dependency structures using the procedure described in (Johansson and Nugues, 2007).

The original Penn Treebank contains non-terminal labels, function tags and part-of-speech tags, all assigned manually. The CoNLL 2009 version contains manual and automatic disambiguation. See above for documentation of the part-of-speech tags. Use DZ Interset to inspect the tagset. The original treebank and the CoNLL 2007 version does not contain lemmas. The CoNLL 2009 version includes some lemmas but they are just lowercased word forms most of the time, e.g. nouns are not converted to singular. Nevertheless, there is some base-form normalization of verbs.

Sample

The first two sentences of section 00 of the WSJ part of the Treebank-3 in the original format:

( (S
    (NP-SBJ
      (NP (NNP Pierre) (NNP Vinken) )
      (, ,)
      (ADJP
        (NP (CD 61) (NNS years) )
        (JJ old) )
      (, ,) )
    (VP (MD will)
      (VP (VB join)
        (NP (DT the) (NN board) )
        (PP-CLR (IN as)
          (NP (DT a) (JJ nonexecutive) (NN director) ))
        (NP-TMP (NNP Nov.) (CD 29) )))
    (. .) ))
( (S
    (NP-SBJ (NNP Mr.) (NNP Vinken) )
    (VP (VBZ is)
      (NP-PRD
        (NP (NN chairman) )
        (PP (IN of)
          (NP
            (NP (NNP Elsevier) (NNP N.V.) )
            (, ,)
            (NP (DT the) (NNP Dutch) (VBG publishing) (NN group) )))))
    (. .) ))

The first sentence of the CoNLL 2007 training data:

1 In _ IN IN _ 43 ADV _ _
2 an _ DT DT _ 5 NMOD _ _
3 Oct. _ NN NNP _ 5 TMP _ _
4 19 _ CD CD _ 3 NMOD _ _
5 review _ NN NN _ 1 PMOD _ _
6 of _ IN IN _ 5 NMOD _ _
7 `` _ `` `` _ 9 P _ _
8 The _ DT DT _ 9 NMOD _ _
9 Misanthrope _ NN NN _ 6 PMOD _ _
10 '' _ '' '' _ 9 P _ _
11 at _ IN IN _ 9 NMOD _ _
12 Chicago _ NN NNP _ 15 NMOD _ _
13 's _ PO POS _ 12 NMOD _ _
14 Goodman _ NN NNP _ 15 NMOD _ _
15 Theatre _ NN NNP _ 11 PMOD _ _
16 ( _ ( ( _ 20 P _ _
17 `` _ `` `` _ 20 P _ _
18 Revitalized _ VB VBN _ 19 NMOD _ _
19 Classics _ NN NNS _ 20 SBJ _ _
20 Take _ VB VBP _ 5 PRN _ _
21 the _ DT DT _ 22 NMOD _ _
22 Stage _ NN NN _ 20 OBJ _ _
23 in _ IN IN _ 20 ADV _ _
24 Windy _ NN NNP _ 25 NMOD _ _
25 City _ NN NNP _ 23 PMOD _ _
26 , _ , , _ 20 P _ _
27 '' _ '' '' _ 20 P _ _
28 Leisure _ NN NN _ 29 COORD _ _
29 & _ CC CC _ 20 DEP _ _
30 Arts _ NN NNS _ 29 COORD _ _
31 ) _ ) ) _ 20 P _ _
32 , _ , , _ 43 P _ _
33 the _ DT DT _ 34 NMOD _ _
34 role _ NN NN _ 43 SBJ _ _
35 of _ IN IN _ 34 NMOD _ _
36 Celimene _ NN NNP _ 35 PMOD _ _
37 , _ , , _ 34 P _ _
38 played _ VB VBN _ 34 NMOD _ _
39 by _ IN IN _ 38 LGS _ _
40 Kim _ NN NNP _ 41 NMOD _ _
41 Cattrall _ NN NNP _ 39 PMOD _ _
42 , _ , , _ 34 P _ _
43 was _ VB VBD _ 0 ROOT _ _
44 mistakenly _ RB RB _ 45 ADV _ _
45 attributed _ VB VBN _ 43 VC _ _
46 to _ TO TO _ 45 ADV _ _
47 Christina _ NN NNP _ 48 NMOD _ _
48 Haag _ NN NNP _ 46 PMOD _ _
49 . _ . . _ 43 P _ _

The first sentence of the CoNLL 2007 test data:

1 No _ RB RB _ 4 VMOD _ _
2 , _ , , _ 4 P _ _
3 it _ PR PRP _ 4 SBJ _ _
4 was _ VB VBD _ 0 ROOT _ _
5 n't _ RB RB _ 4 VMOD _ _
6 Black _ NN NNP _ 7 NMOD _ _
7 Monday _ NN NNP _ 4 VMOD _ _
8 . _ . . _ 4 P _ _

The first sentence of the CoNLL 2009 training data:

1 In in in IN IN _ _ 43 20 LOC ADV _ _ _ _ _ _ _ _ AM-LOC
2 an an an DT DT _ _ 5 5 NMOD NMOD _ _ _ _ _ _ _ _ _
3 Oct. oct. oct. NNP NNP _ _ 4 4 NMOD NMOD _ _ _ _ _ _ _ _ _
4 19 19 19 CD CD _ _ 5 5 NMOD NMOD _ _ AM-TMP _ _ _ _ _ _
5 review review review NN NN _ _ 1 1 PMOD PMOD Y review.01 _ _ _ _ _ _ _
6 of of of IN IN _ _ 5 5 NMOD NMOD _ _ A1 _ _ _ _ _ _
7 `` `` `` `` `` _ _ 9 6 P P _ _ _ _ _ _ _ _ _
8 The the the DT DT _ _ 9 9 NMOD NMOD _ _ _ _ _ _ _ _ _
9 Misanthrope misanthrope misanthrope NN NN _ _ 6 6 PMOD PMOD _ _ _ _ _ _ _ _ _
10 '' '' '' '' '' _ _ 9 5 P P _ _ _ _ _ _ _ _ _
11 at at at IN IN _ _ 9 5 LOC LOC _ _ _ _ _ _ _ _ _
12 Chicago chicago chicago NNP NNP _ _ 15 15 NMOD NMOD _ _ _ _ _ _ _ _ _
13 's 's 's POS POS _ _ 12 12 SUFFIX SUFFIX _ _ _ _ _ _ _ _ _
14 Goodman goodman goodman NNP NNP _ _ 15 15 NAME NAME _ _ _ _ _ _ _ _ _
15 Theatre theatre theatre NNP NNP _ _ 11 11 PMOD PMOD _ _ _ _ _ _ _ _ _
16 ( -lrb- -lrb- ( ( _ _ 20 20 P P _ _ _ _ _ _ _ _ _
17 `` `` `` `` `` _ _ 20 19 P P _ _ _ _ _ _ _ _ _
18 Revitalized revitalize revitalize VBN VBN _ _ 19 19 NMOD NMOD Y revitalize.01 _ _ _ _ _ _ _
19 Classics classics classics NNS NNS _ _ 20 20 SBJ SBJ _ _ _ A1 A0 A1 _ _ _
20 Take take take VBP VB _ _ 5 43 PRN OBJ Y take.01 _ _ _ _ _ _ _
21 the the the DT DT _ _ 22 22 NMOD NMOD _ _ _ _ _ _ _ _ _
22 Stage stage stage NN NNP _ _ 20 20 OBJ OBJ Y stage.02 _ _ A1 _ _ _ _
23 in in in IN IN _ _ 20 22 LOC LOC _ _ _ _ AM-LOC _ _ _ _
24 Windy windy windy NNP NNP _ _ 25 25 NAME NAME _ _ _ _ _ _ _ _ _
25 City city city NNP NNP _ _ 23 23 PMOD PMOD _ _ _ _ _ _ _ _ _
26 , , , , , _ _ 20 43 P P _ _ _ _ _ _ _ _ _
27 '' '' '' '' '' _ _ 20 43 P P _ _ _ _ _ _ _ _ _
28 Leisure leisure leisure NNP NNP _ _ 30 30 NAME NAME _ _ _ _ _ _ _ _ _
29 & & & CC CC _ _ 30 30 NAME NAME _ _ _ _ _ _ _ _ _
30 Arts arts arts NNS NNS _ _ 20 34 TMP NMOD _ _ _ _ _ _ _ _ _
31 ) -rrb- -rrb- ) ) _ _ 20 30 P P _ _ _ _ _ _ _ _ _
32 , , , , , _ _ 43 34 P P _ _ _ _ _ _ _ _ _
33 the the the DT DT _ _ 34 34 NMOD NMOD _ _ _ _ _ _ _ _ _
34 role role role NN NN _ _ 43 43 SBJ SBJ Y role.01 _ _ _ _ _ A1 A1
35 of of of IN IN _ _ 34 34 NMOD NMOD _ _ _ _ _ _ A1 _ _
36 Celimene celimene celimene NNP NNP _ _ 35 35 PMOD PMOD _ _ _ _ _ _ _ _ _
37 , , , , , _ _ 34 34 P P _ _ _ _ _ _ _ _ _
38 played play play VBN VBN _ _ 34 34 APPO APPO Y play.02 _ _ _ _ _ _ _
39 by by by IN IN _ _ 38 38 LGS LGS _ _ _ _ _ _ _ A0 _
40 Kim kim kim NNP NNP _ _ 41 41 NAME NAME _ _ _ _ _ _ _ _ _
41 Cattrall cattrall cattrall NNP NNP _ _ 39 39 PMOD PMOD _ _ _ _ _ _ A0 _ _
42 , , , , , _ _ 34 34 P P _ _ _ _ _ _ _ _ _
43 was be be VBD VBD _ _ 0 0 ROOT ROOT _ _ _ _ _ _ _ _ _
44 mistakenly mistakenly mistakenly RB RB _ _ 45 45 MNR AMOD _ _ _ _ _ _ _ _ AM-MNR
45 attributed attribute attribute VBN VBN _ _ 43 43 VC PRD Y attribute.01 _ _ _ _ _ _ _
46 to to to TO TO _ _ 45 45 ADV AMOD _ _ _ _ _ _ _ _ A2
47 Christina christina christina NNP NNP _ _ 48 48 NAME NAME _ _ _ _ _ _ _ _ _
48 Haag haag haag NNP NNP _ _ 46 46 PMOD PMOD _ _ _ _ _ _ _ _ _
49 . . . . . _ _ 43 43 P P _ _ _ _ _ _ _ _ _

The first sentence of the CoNLL 2009 development data:

1 The the the DT DT _ _ 2 2 NMOD NMOD _ _ _ _ _ _
2 economy economy economy NN NN _ _ 4 4 NMOD NMOD _ _ A1 _ _ _
3 's 's 's POS POS _ _ 2 2 SUFFIX SUFFIX _ _ _ _ _ _
4 temperature temperature temperature NN NN _ _ 5 5 SBJ SBJ Y temperature.01 A2 A1 _ _
5 will will will MD MD _ _ 0 0 ROOT ROOT _ _ _ AM-MOD _ _
6 be be be VB VB _ _ 5 5 VC VC _ _ _ _ _ _
7 taken take take VBN VBN _ _ 6 6 VC VC Y take.01 _ _ _ _
8 from from from IN IN _ _ 7 7 ADV ADV _ _ _ A2 _ _
9 several several several DT DT _ _ 11 11 NMOD NMOD _ _ _ _ _ _
10 vantage vantage vantage NN NN _ _ 11 11 NMOD NMOD _ _ _ _ A1 _
11 points point point NNS NNS _ _ 8 8 PMOD PMOD Y point.02 _ _ _ _
12 this this this DT DT _ _ 13 13 NMOD NMOD _ _ _ _ _ _
13 week week week NN NN _ _ 7 7 TMP TMP _ _ _ AM-TMP _ _
14 , , , , , _ _ 7 7 P P _ _ _ _ _ _
15 with with with IN IN _ _ 7 7 ADV ADV _ _ _ AM-ADV _ _
16 readings reading reading NNS NNS _ _ 15 15 PMOD PMOD Y reading.01 _ _ _ _
17 on on on IN IN _ _ 16 16 NMOD NMOD _ _ _ _ _ A1
18 trade trade trade NN NN _ _ 17 17 PMOD PMOD _ _ _ _ _ _
19 , , , , , _ _ 18 18 P P _ _ _ _ _ _
20 output output output NN NN _ _ 18 18 COORD COORD _ _ _ _ _ _
21 , , , , , _ _ 20 20 P P _ _ _ _ _ _
22 housing housing housing NN NN _ _ 20 20 COORD COORD _ _ _ _ _ _
23 and and and CC CC _ _ 22 22 COORD COORD _ _ _ _ _ _
24 inflation inflation inflation NN NN _ _ 23 23 CONJ CONJ _ _ _ _ _ _
25 . . . . . _ _ 5 5 P P _ _ _ _ _ _

The first sentence of the CoNLL 2009 test data:

1 No no no DT DT _ _ _ _ _ _ _
2 , , , , , _ _ _ _ _ _ _
3 it it it PRP PRP _ _ _ _ _ _ _
4 was be be VBD VBD _ _ _ _ _ _ _
5 n't not not RB RB _ _ _ _ _ _ _
6 Black black black NNP NNP _ _ _ _ _ _ _
7 Monday monday monday NNP NNP _ _ _ _ _ _ _
8 . . . . . _ _ _ _ _ _ _

Parsing

PDT is a mildly nonprojective treebank. 8351 of the 437,020 tokens in the CoNLL 2007 version are attached nonprojectively (1.91%).

There is an online summary of known results in Czech parsing.

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Czech:

Parser (Authors) LAS UAS
MST (McDonald et al.) 80.18 87.30
Basis (O'Neil) 76.60 85.58
Malt (Nivre et al.) 78.42 84.80
Nara (Yuchang Cheng) 76.24 83.40

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Czech:

Parser (Authors) LAS UAS
Nakagawa 80.19 86.28
Carreras 78.60 85.16
Titov et al. 77.94 84.19
Malt (Nilsson et al.) 77.98 83.59
Attardi et al. 77.37 83.40
Malt (Hall et al.) 77.22 82.35

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

The results of the CoNLL 2009 shared task are available online. They have been published in (Hajič et al., 2009). Unlabeled attachment score was not published. These are the best results for Czech:

Parser (Authors) LAS
Merlo (Gesmundo et al.) 80.38
Bohnet 80.11
Che et al. 80.01

[ Back to the navigation ] [ Back to the content ]