This is an old revision of the document!
Table of Contents
Romanian (ro)
Versions
- online Romanian texts annotated using DGA
Obtaining and License
The syntactically annotated Romanian texts are available at http://www.phobos.ro/roric/texts/xml/. This is a bash script that will download the corpus:
#!/bin/tcsh -f foreach i ( t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 tp1 tp2 tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 tp16 tp17 tp18 tp19 tp20 tp21 tp22 tp23 tp24 tp25 tp26 tp27 tp28 tp29 tp30 tp31 tp32 tp33 tp34 tp35 tp36 tp37 tp38 tp39 tp40 tp41 tp42 tp43 tp44 tp45 tp46 tp47 tp48 tp49 tp50 tp51 tp52 tp53 tp54 tp55 tp56 tp57 tp58 tp59 tp60 tp61 tp62 tp63 tp64 tp65 tp66 tp67 tp68 tp69 tp70 tp71 tp72 tp73 tp74 tp75 tp76 tp77 tp78 tp79 tp80 tp81 tp82 ) wget http://www.phobos.ro/roric/texts/xml/$i.xml end wget http://www.phobos.ro/roric/texts/xml/dga.dtd
Licensing terms are unknown. I tend to interpret their “public availability” in the following way:
- any usage, commercial or not
- modification and redistribution under a free license permitted, provided the original source is mentioned
- citation in publications not required (but it is common decency)
The texts were prepared by members of RORIC-LING, Faculty of Mathematics and Computer Science (Facultatea de Matematica şi Informatica), University of Bucharest (Universitatea din Bucureşti), Str. Academiei nr. 14, sector 1, RO-010014, Bucureşti, Romania.
References
http://www.phobos.ro/roric/
http://www.phobos.ro/roric/Ro/dg.html
http://www.phobos.ro/roric/Ro/DGA/dga.html
http://www.phobos.ro/roric/texts/indexro.html
http://www.phobos.ro/roric/texts/xml/
- Website
- http://ilk.uvt.nl/conll/free_data.html (CoNLL 2006)
- Data
- no separate citation
- Principal publications
- Susana Afonso, Eckhard Bick, Renato Haber, Diana Santos: Floresta sintá(c)tica: um treebank para o português. In: Encontro da associação portuguesa de linguística, XVII, Lisboa, 2001.
- Cláudia Freitas, Paulo Rocha, Eckhard Bick: Um mundo novo na Floresta Sintá(c)tica - o treebank para Português. Calidoscópio - Revista de Pós Graduação em Lingüística Aplicada da Unisinos, Rio Grande do Sul 6.3 (2008), pp. 142-148.
- Documentation
- Cláudia Freitas, Susana Afonso: Bíblia Florestal: Um manual lingüístico da Floresta Sintá(c)tica, 2008
- Glossário de etiquetas florestais (glossary of tags)
Domain
Newspaper. Bosque contains 9368 sentences mostly from two primary sources, the CETENFolha (Corpus de Extractos de Textos Electrónicos NILC/Folha de São Paulo, texts from the Brazilian journal Folha de São Paulo, year 1994) and CETEMPúblico (Corpus de Extractos de Textos Electrónicos MCT/Público, texts from the Portuguese (European) journal Público, April 2000).
Size
The CoNLL 2006 version contains 212,545 tokens in 9359 sentences, yielding 22.71 tokens per sentence on average (CoNLL 2006 data split: 206,678 tokens / 9071 sentences training, 5867 tokens / 288 sentences test).
Inside
The corpus contains texts from Portugal and Brazil. The texts were automatically parsed using the PALAVRAS parser (Bick 2000: Eckhard Bick. The Parsing System “Palavras”: Automatic Grammatical Analysis of Portuguese in a Constraint Grammar Framework. Dr.phil. thesis. Aarhus University. Aarhus, Denmark: Aarhus University Press. November 2000.) and revised by linguists (the Bosque part, referred here, was totally revised; the other parts of the Floresta sintáctica project were either partially or not at all revised).
Morphological annotation includes lemmas. In the CoNLL version, the original Floresta tags were converted to fit the CPOS
, POS
and FEAT
columns of the CoNLL format. Use DZ Interset to inspect the CoNLL tagset.
Multi-word expressions have been concatenated into one token, using underscore as the joining character (e.g. “7_e_Meio”, “Hillary_Clinton”).
Sample
The first two sentences of the CoNLL 2006 training data:
1 | Um | um | art | art | <arti>|M|S | 2 | >N | _ | _ |
2 | revivalismo | revivalismo | n | n | M|S | 0 | UTT | _ | _ |
3 | refrescante | refrescante | adj | adj | M|S | 2 | N< | _ | _ |
1 | O | o | art | art | <artd>|M|S | 2 | >N | _ | _ |
2 | 7_e_Meio | 7_e_Meio | prop | prop | M|S | 3 | SUBJ | _ | _ |
3 | é | ser | v | v-fin | PR|3S|IND | 0 | STA | _ | _ |
4 | um | um | art | art | <arti>|M|S | 5 | >N | _ | _ |
5 | ex-libris | ex-libris | n | n | M|P | 3 | SC | _ | _ |
6 | de | de | prp | prp | <sam-> | 5 | N< | _ | _ |
7 | a | o | art | art | <-sam>|<artd>|S | 8 | >N | _ | _ |
8 | noite | noite | n | n | F|S | 6 | P< | _ | _ |
9 | algarvia | algarvio | adj | adj | F|S | 8 | N< | _ | _ |
10 | . | . | punc | punc | _ | 3 | PUNC | _ | _ |
The first two sentences of the CoNLL 2006 test data:
1 | É | é | adv | adv | <foc> | 9 | FOC | _ | _ |
2 | por | por | prp | prp | _ | 9 | ADVL | _ | _ |
3 | isso | isso | pron | pron-indp | <dem>|M|S | 2 | P< | _ | _ |
4 | que | que | adv | adv | <foc> | 9 | FOC | _ | _ |
5 | , | , | punc | punc | _ | 6 | PUNC | _ | _ |
6 | explica | explicar | v | v-fin | PR|3S|IND | 0 | STA | _ | _ |
7 | , | , | punc | punc | _ | 6 | PUNC | _ | _ |
8 | não | não | adv | adv | _ | 9 | ADVL | _ | _ |
9 | tem | ter | v | v-fin | PR|3S|IND | 6 | ACC | _ | _ |
10 | pena | pena | n | n | F|S | 9 | ACC | _ | _ |
11 | de | de | prp | prp | _ | 10 | N< | _ | _ |
12 | Hillary_Clinton | Hillary_Clinton | prop | prop | F|S | 11 | P< | _ | _ |
13 | . | . | punc | punc | _ | 6 | PUNC | _ | _ |
1 | « | « | punc | punc | _ | 8 | PUNC | _ | _ |
2 | Eles | ele | pron | pron-pers | M|3P|NOM | 8 | SUBJ | _ | _ |
3 | [ | [ | punc | punc | _ | 8 | PUNC | _ | _ |
4 | Hillary | Hillary | prop | prop | F|S | 9 | APP | _ | _ |
5 | e | e | conj | conj-c | <co-app> | 4 | CO | _ | _ |
6 | Bill_Clinton | Bill_Clinton | prop | prop | M|S | 4 | CJT | _ | _ |
7 | ] | ] | punc | punc | _ | 8 | PUNC | _ | _ |
8 | podem | poder | v | v-fin | PR|3P|IND | 0 | QUE | _ | _ |
9 | ter | ter | v | v-inf | _ | 8 | MV | _ | _ |
10 | alguma | algum | pron | pron-det | <quant>|F|S | 11 | >N | _ | _ |
11 | espécie | espécie | n | n | F|S | 9 | ACC | _ | _ |
12 | de | de | prp | prp | _ | 11 | N< | _ | _ |
13 | acordo | acordo | n | n | M|S | 12 | P< | _ | _ |
14 | e | e | conj | conj-c | <co-vfin>|<co-fmc> | 8 | CO | _ | _ |
15 | quem | quem | pron | pron-indp | <interr>|M/F|P | 16 | SC | _ | _ |
16 | somos | ser | v | v-fin | PR|1P|IND | 8 | CJT | _ | _ |
17 | nós | nós | pron | pron-pers | M/F|1P|NOM | 16 | SUBJ | _ | _ |
18 | para | para | prp | prp | _ | 16 | ADVL | _ | _ |
19 | dizer | dizer | v | v-inf | _ | 18 | P< | _ | _ |
20 | se | se | conj | conj-s | _ | 21 | SUB | _ | _ |
21 | é | ser | v | v-fin | PR|3S|IND | 19 | ACC | _ | _ |
22 | bom | bom | adj | adj | M|S | 21 | SC | _ | _ |
23 | ou | ou | conj | conj-c | <co-sc> | 22 | CO | _ | _ |
24 | mau | mau | adj | adj | M|S | 22 | CJT | _ | _ |
25 | ? | ? | punc | punc | _ | 8 | PUNC | _ | _ |
Parsing
Bosque is a mildly nonprojective treebank. 2778 of the 212,545 tokens in the CoNLL 2006 version are attached nonprojectively (1.31%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Portuguese:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 86.82 | 91.36 |
Malt (Nivre et al.) | 87.60 | 91.22 |
Nara (Yuchang Cheng) | 85.07 | 90.30 |