[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

Catalan (ca)

There is one treebank versions of which were known in different times under different names:

Versions

The dependency treebank Cat3LB was extracted automatically from an earlier constituent-based annotation (see Montserrat Civit, Ma. Antònia Martí, Núria Bufí: Cat3LB and Cast3LB: From Constituents to Dependencies. In: T. Salakoski et al. (eds.): FinTAL 2006, LNAI 4139, pp. 141–152, 2006, Springer, Berlin / Heidelberg)

Obtaining and License

The AnCora-CA corpus ought to be freely downloadable from its website. The download will not work for unregistered and not signed in users. The website offers creating new account but it is not automatic, one has to wait for approval.

Republication of the two CoNLL versions in LDC is planned but it has not happenned yet.

The CoNLL 2007 license in short:

AnCora-CA was created by members of the Centre de Llenguatge i Computació (CLiC), Universitat de Barcelona, Gran Via de les Corts Catalanes 585, E-08007 Barcelona, Spain.

References

Domain

Mostly newswire (EFE news, ACN Catalan news, Catalan version of El Periódico, 2000).

Size

The CoNLL 2007 version contains 435,860 tokens in 15125 sentences, yielding 28.82 tokens per sentence on average (CoNLL 2007 data split: 430,844 tokens / 14958 sentences training, 5016 tokens / 167 sentences test).

The CoNLL 2009 version contains 496,672 tokens in 16786 sentences, yielding 29.59 tokens per sentence on average (CoNLL 2009 data split: 390,302 tokens / 13200 sentences training, 53015 tokens / 1724 sentences development, 53355 tokens / 1862 sentences test).

Inside

The original morphosyntactic tags (EAGLES?) have been converted to fit into the three columns (CPOS, POS and FEAT) columns of the CoNLL 2006/7 format, resp. the two columns (POS and FEAT) of the CoNLL 2009 format. Note that the missing CPOS column is not the only difference between the two conversion schemes. Feature names and values in the FEAT column are different, too.

The morphosyntactic tags have been disambiguated manually. The CoNLL 2009 version also contains automatically disambiguated tags.

Multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes named entities (e.g. La_Garrotxa, Ajuntament_de_Manresa, dilluns_4_de_juny) and prepositional compounds (pel_que_fa_al, d'_acord_amb, la_seva, a_més_de). Empty (underscore) tokens have been inserted to represent missing subjects (Catalan is a pro-drop language).

Sample

The first sentence of the CoNLL 2007 training data:

1 L' el d da num=s|gen=c 2 ESPEC _ _
2 Ajuntament_de_Manresa Ajuntament_de_Manresa n np _ 4 SUJ _ _
3 ha haver v va num=s|per=3|mod=i|ten=p 4 AUX _ _
4 posat_en_funcionament posar_en_funcionament v vm num=s|mod=p|gen=m 0 S _ _
5 tot tot d di num=s|gen=m 7 ESPEC _ _
6 un_seguit_de un_seguit_de d di num=p|gen=c 5 DET _ _
7 mesures mesura n nc num=p|gen=f 4 CD _ _
8 , , F Fc _ 10 PUNC _ _
9 la el d da num=s|gen=f 10 ESPEC _ _
10 majoria majoria n nc num=s|gen=f 7 _ _ _
11 informatives informatiu a aq num=p|gen=f 10 _ _ _
12 , , F Fc _ 10 PUNC _ _
13 que que p pr num=n|gen=c 14 SUJ _ _
14 tenen tenir v vm num=p|per=3|mod=i|ten=p 7 SF _ _
15 com_a com_a s sp for=s 14 CPRED _ _
16 finalitat finalitat n nc num=s|gen=f 15 SN _ _
17 minimitzar minimitzar v vm mod=n 14 CD _ _
18 els el d da num=p|gen=m 19 ESPEC _ _
19 efectes efecte n nc num=p|gen=m 17 SN _ _
20 de de s sp for=s 19 SP _ _
21 la el d da num=s|gen=f 22 ESPEC _ _
22 vaga vaga n nc num=s|gen=f 20 SN _ _
23 . . F Fp _ 4 PUNC _ _

The first sentence of the CoNLL 2007 test data:

1 Tot_i_que tot_i_que c cs _ 5 SUBORD _ _
2 ahir ahir r rg _ 5 CC _ _
3 hi hi p pp num=n|per=3|gen=c 5 MORF _ _
4 va anar v va num=s|per=3|mod=i|ten=p 5 AUX _ _
5 haver haver v va mod=n 15 AO _ _
6 una un d di num=s|gen=f 7 ESPEC _ _
7 reunió reunió n nc num=s|gen=f 5 CD _ _
8 de de s sp for=s 7 SP _ _
9 darrera darrer a ao num=s|gen=f 10 SADJ _ _
10 hora hora n nc num=s|gen=f 8 SN _ _
11 , , F Fc _ 5 PUNC _ _
12 no no r rn _ 15 MOD _ _
13 es es p p0 _ 15 PASS _ _
14 va anar v va num=s|per=3|mod=i|ten=p 15 AUX _ _
15 aconseguir aconseguir v vm mod=n 0 S _ _
16 acostar acostar v vm mod=n 15 SUJ _ _
17 posicions posició n nc num=p|gen=f 16 SN _ _
18 , , F Fc _ 23 PUNC _ _
19 de_manera_que de_manera_que c cs _ 23 SUBORD _ _
20 els el d da num=p|gen=m 21 ESPEC _ _
21 treballadors treballador n nc num=p|gen=m 23 SUJ _ _
22 han haver v va num=p|per=3|mod=i|ten=p 23 AUX _ _
23 decidit decidir v vm num=s|mod=p|gen=m 15 AO _ _
24 anar anar v vm mod=n 23 CD _ _
25 a a s sp for=s 24 CREG _ _
26 la el d da num=s|gen=f 27 ESPEC _ _
27 vaga vaga n nc num=s|gen=f 25 SN _ _
28 . . F Fp _ 15 PUNC _ _

The first sentence of the CoNLL 2009 training data:

1 El el el d d postype=article|gen=m|num=s postype=article|gen=m|num=s 2 2 spec spec _ _ _ _ _ _
2 Tribunal_Suprem Tribunal_Suprem Tribunal_Suprem n n postype=proper|gen=c|num=c postype=proper|gen=c|num=c 7 7 suj suj _ _ arg0-agt _ _ _
3 ( ( ( f f punct=bracket|punctenclose=open punct=bracket|punctenclose=open 4 4 f f _ _ _ _ _ _
4 TS TS TS n n postype=proper|gen=c|num=c postype=proper|gen=c|num=c 2 2 sn sn _ _ _ _ _ _
5 ) ) ) f f punct=bracket|punctenclose=close punct=bracket|punctenclose=close 4 4 f f _ _ _ _ _ _
6 ha haver haver v v postype=auxiliary|gen=c|num=s|person=3|mood=indicative|tense=present postype=auxiliary|gen=c|num=s|person=3|mood=indicative|tense=present 7 7 v v _ _ _ _ _ _
7 confirmat confirmar confirmar v v postype=main|gen=m|num=s|mood=pastparticiple postype=main|gen=m|num=s|mood=pastparticiple 0 0 sentence sentence Y confirmar.a32 _ _ _ _
8 la el el d d postype=article|gen=f|num=s postype=article|gen=f|num=s 9 9 spec spec _ _ _ _ _ _
9 condemna condemna condemna n n postype=common|gen=f|num=s postype=common|gen=f|num=s 7 7 cd cd _ _ arg1-pat _ _ _
10 a a a s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 9 9 sp sp _ _ _ _ _ _
11 quatre quatre quatre d d postype=numeral|gen=c|num=p postype=numeral|gen=c|num=p 12 12 spec spec _ _ _ _ _ _
12 anys any any n n postype=common|gen=m|num=p postype=common|gen=m|num=p 10 10 sn sn _ _ _ _ _ _
13 d' de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 12 12 sp sp _ _ _ _ _ _
14 inhabilitació inhabilitació inhabilitació n n postype=common|gen=f|num=s postype=common|gen=f|num=s 13 13 sn sn _ _ _ _ _ _
15 especial especial especial a a postype=qualificative|gen=c|num=s postype=qualificative|gen=c|num=s 14 14 s.a s.a _ _ _ _ _ _
16 i i i c c postype=coordinating postype=coordinating 12 9 coord coord _ _ _ _ _ _
17 una un un d d postype=indefinite|gen=f|num=s postype=numeral|gen=f|num=s 18 18 spec spec _ _ _ _ _ _
18 multa multa multa n n postype=common|gen=f|num=s postype=common|gen=f|num=s 12 9 sn sn _ _ _ _ _ _
19 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 18 18 sp sp _ _ _ _ _ _
20 3,6 3.6 3,6 z n _ postype=proper|gen=c|num=c 21 21 spec spec _ _ _ _ _ _
21 milions milió milió n n postype=common|gen=m|num=p postype=common|gen=m|num=p 19 19 sn sn _ _ _ _ _ _
22 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 21 21 sp sp _ _ _ _ _ _
23 pessetes pesseta pesseta z n postype=currency postype=common|gen=f|num=p 22 22 sn sn _ _ _ _ _ _
24 per per per s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 9 9 sp sp _ _ _ _ _ _
25 a a a s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 24 24 sp sp _ _ _ _ _ _
26 quatre quatre quatre d d postype=numeral|gen=c|num=p postype=numeral|gen=c|num=p 27 27 spec spec _ _ _ _ _ _
27 veterinaris veterinari veterinari n n postype=common|gen=m|num=p postype=common|gen=m|num=p 25 25 sn sn _ _ _ _ _ _
28 gironins gironí gironí a a postype=qualificative|gen=m|num=p postype=qualificative|gen=m|num=p 27 27 s.a s.a _ _ _ _ _ _
29 , , , f f punct=comma punct=comma 30 30 f f _ _ _ _ _ _
30 per per per s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 9 7 sp cc _ _ _ _ _ _
31 haver haver haver v n postype=auxiliary|gen=c|num=c|mood=infinitive postype=common|gen=m|num=s 33 33 v v _ _ _ _ _ _
32 -se ell ell p p gen=c|num=c|person=3 gen=c|num=c|person=3 33 33 morfema.pronominal morfema.pronominal _ _ _ _ _ _
33 beneficiat beneficiar beneficiat v a postype=main|gen=m|num=s|mood=pastparticiple postype=qualificative|gen=m|num=s|posfunction=participle 42 30 S S Y beneficiar.a2 _ _ _ _
34 dels del dels s s postype=preposition|gen=m|num=p|contracted=yes postype=preposition|gen=m|num=p|contracted=yes 33 33 creg creg _ _ _ arg1-null _ _
35 càrrecs càrrec càrrec n n postype=common|gen=m|num=p postype=common|gen=m|num=p 34 34 sn sn _ _ _ _ _ _
36 públics públic públic a a postype=qualificative|gen=m|num=p postype=qualificative|gen=m|num=p 35 35 s.a s.a _ _ _ _ _ _
37 que que que p p postype=relative|gen=c|num=c postype=relative|gen=c|num=c 39 39 cd cd _ _ _ _ arg1-pat _
38 _ _ _ p p _ _ 39 39 suj suj _ _ _ _ arg0-agt _
39 desenvolupaven desenvolupar desenvolupar v v postype=main|gen=c|num=p|person=3|mood=indicative|tense=imperfect postype=main|gen=c|num=p|person=3|mood=indicative|tense=imperfect 35 35 S S Y desenvolupar.a2 _ _ _ _
40 i i i c c postype=coordinating postype=coordinating 42 33 coord coord _ _ _ _ _ _
41 la_seva el_seu el_seu d d postype=possessive|gen=f|num=s|person=3 postype=possessive|gen=f|num=s|person=3 42 42 spec spec _ _ _ _ _ _
42 relació relació relació n n postype=common|gen=f|num=s postype=common|gen=f|num=s 30 33 sn cd _ _ _ _ _ _
43 amb amb amb s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 42 42 sp sp _ _ _ _ _ _
44 les el el d d postype=article|gen=f|num=p postype=article|gen=f|num=p 45 45 spec spec _ _ _ _ _ _
45 empreses empresa empresa n n postype=common|gen=f|num=p postype=common|gen=f|num=p 43 43 sn sn _ _ _ _ _ _
46 càrniques càrnic càrnic a a postype=qualificative|gen=f|num=p postype=qualificative|gen=f|num=p 45 45 s.a s.a _ _ _ _ _ _
47 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 45 45 sp sp _ _ _ _ _ _
48 la el el d d postype=article|gen=f|num=s postype=article|gen=f|num=s 49 49 spec spec _ _ _ _ _ _
49 zona zona zona n n postype=common|gen=f|num=s postype=common|gen=f|num=s 47 47 sn sn _ _ _ _ _ _
50 en en en s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 42 42 sp sp _ _ _ _ _ _
51 oferir oferir oferir v v postype=main|gen=c|num=c|mood=infinitive postype=main|gen=c|num=c|mood=infinitive 50 50 S S Y oferir.a32 _ _ _ _
52 -los ell ell p p postype=personal|gen=c|num=p|person=3 postype=personal|gen=c|num=p|person=3 51 51 ci ci _ _ _ _ _ arg2-ben
53 serveis servei servei n n postype=common|gen=m|num=p postype=common|gen=m|num=p 51 51 cd cd _ _ _ _ _ arg1-pat
54 particulars particular particular a a postype=qualificative|gen=c|num=p postype=qualificative|gen=c|num=p 53 53 s.a s.a _ _ _ _ _ _
55 . . . f f punct=period punct=period 7 7 f f _ _ _ _ _ _

The first sentence of the CoNLL 2009 development data:

1 Fundació_Privada_Fira_de_Manresa Fundació_Privada_Fira_de_Manresa Fundació_Privada_Fira_de_Manresa n n postype=proper|gen=c|num=c postype=proper|gen=c|num=c 3 3 suj suj _ _ arg0-agt
2 ha haver haver v v postype=auxiliary|gen=c|num=s|person=3|mood=indicative|tense=present postype=auxiliary|gen=c|num=s|person=3|mood=indicative|tense=present 3 3 v v _ _ _
3 fet fer fer v v postype=main|gen=m|num=s|mood=pastparticiple postype=main|gen=m|num=s|mood=pastparticiple 0 0 sentence sentence Y fer.a2 _
4 un un un d d postype=numeral|gen=m|num=s postype=numeral|gen=m|num=s 5 5 spec spec _ _ _
5 balanç balanç balanç n n postype=common|gen=m|num=s postype=common|gen=m|num=s 3 3 cd cd _ _ arg1-pat
6 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 5 5 sp sp _ _ _
7 l' el el d d postype=article|gen=c|num=s postype=article|gen=c|num=s 8 8 spec spec _ _ _
8 activitat activitat activitat n n postype=common|gen=f|num=s postype=common|gen=f|num=s 6 6 sn sn _ _ _
9 del del del s s postype=preposition|gen=m|num=s|contracted=yes postype=preposition|gen=m|num=s|contracted=yes 8 8 sp sp _ _ _
10 Palau_Firal Palau_Firal Palau_Firal n n postype=proper|gen=c|num=c postype=proper|gen=c|num=c 9 9 sn sn _ _ _
11 durant durant durant s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 8 3 sp cc _ _ _
12 els el el d d postype=article|gen=m|num=p postype=article|gen=m|num=p 15 15 spec spec _ _ _
13 primers primer primer a a postype=ordinal|gen=m|num=p postype=ordinal|gen=m|num=p 12 12 a a _ _ _
14 cinc cinc cinc d d postype=numeral|gen=c|num=p postype=numeral|gen=c|num=p 12 12 d d _ _ _
15 mesos mes mes n n postype=common|gen=m|num=p postype=common|gen=m|num=p 11 11 sn sn _ _ _
16 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c 15 15 sp sp _ _ _
17 l' el el d d postype=article|gen=c|num=s postype=article|gen=c|num=s 18 18 spec spec _ _ _
18 any any any n n postype=common|gen=m|num=s postype=common|gen=m|num=s 16 16 sn sn _ _ _
19 . . . f f punct=period punct=period 3 3 f f _ _ _

The first sentence of the CoNLL 2009 test data:

1 El el el d d postype=article|gen=m|num=s postype=article|gen=m|num=s _ _ _ _ _
2 darrer darrer darrer a a postype=ordinal|gen=m|num=s postype=ordinal|gen=m|num=s _ _ _ _ _
3 número número número n n postype=common|gen=m|num=s postype=common|gen=m|num=s _ _ _ _ _
4 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c _ _ _ _ _
5 l' el el d d postype=article|gen=c|num=s postype=article|gen=c|num=s _ _ _ _ _
6 Observatori_del_Mercat_de_Treball_d'_Osona Observatori_del_Mercat_de_Treball_d'_Osona Observatori_del_Mercat_de_Treball_d'_Osona n n postype=proper|gen=c|num=c postype=proper|gen=c|num=c _ _ _ _ _
7 inclou incloure incloure v v postype=main|gen=c|num=s|person=3|mood=indicative|tense=present postype=main|gen=c|num=s|person=3|mood=indicative|tense=present _ _ _ _ Y
8 un un un d d postype=numeral|gen=m|num=s postype=numeral|gen=m|num=s _ _ _ _ _
9 informe informe informe n n postype=common|gen=m|num=s postype=common|gen=m|num=s _ _ _ _ _
10 especial especial especial a a postype=qualificative|gen=c|num=s postype=qualificative|gen=c|num=s _ _ _ _ _
11 sobre sobre sobre s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c _ _ _ _ _
12 la el el d d postype=article|gen=f|num=s postype=article|gen=f|num=s _ _ _ _ _
13 contractació contractació contractació n n postype=common|gen=f|num=s postype=common|gen=f|num=s _ _ _ _ _
14 a_través_de a_través_de a_través_de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c _ _ _ _ _
15 les el el d d postype=article|gen=f|num=p postype=article|gen=f|num=p _ _ _ _ _
16 empreses empresa empresa n n postype=common|gen=f|num=p postype=common|gen=f|num=p _ _ _ _ _
17 de de de s s postype=preposition|gen=c|num=c postype=preposition|gen=c|num=c _ _ _ _ _
18 treball treball treball n n postype=common|gen=m|num=s postype=common|gen=m|num=s _ _ _ _ _
19 temporal temporal temporal a a postype=qualificative|gen=c|num=s postype=qualificative|gen=c|num=s _ _ _ _ _
20 , , , f f punct=comma punct=comma _ _ _ _ _
21 les el el d d postype=article|gen=f|num=p postype=article|gen=f|num=p _ _ _ _ _
22 ETT ETT ETT n n postype=proper|gen=c|num=c postype=proper|gen=c|num=c _ _ _ _ _
23 . . . f f punct=period punct=period _ _ _ _ _

Parsing

Nonprojectivities in AnCora-CA are very rare. Only 487 of the 435,860 tokens in the CoNLL 2007 version are attached nonprojectively (0.11%). In the CoNLL 2009 version, there are no nonprojectivities at all.

The results of the CoNLL 2007 shared task are available online. They have been published in (Nivre et al., 2007). The evaluation procedure was changed to include punctuation tokens. These are the best results for Catalan:

Parser (Authors) LAS UAS
Titov et al. 87.40 93.40
Sagae 88.16 93.34
Malt (Nilsson et al.) 88.70 93.12
Nakagawa 87.90 92.86
Carreras 87.60 92.46
Malt (Hall et al.) 87.74 92.20

The two Malt parser results of 2007 (single malt and blended) are described in (Hall et al., 2007) and the details about the parser configuration are described here.

The results of the CoNLL 2009 shared task are available online. They have been published in (Hajič et al., 2009). Unlabeled attachment score was not published. These are the best results for Catalan:

Parser (Authors) LAS
Merlo 87.86
Che 86.56
Bohnet 86.35
Chen 85.88

[ Back to the navigation ] [ Back to the content ]