Table of Contents
Romanian (ro)
Versions
- online Romanian texts annotated using DGA
Obtaining and License
The syntactically annotated Romanian texts are available at http://www.phobos.ro/roric/texts/xml/. This is a tcsh script that will download the corpus:
#!/bin/tcsh -f foreach i ( t1 t2 t3 t4 t5 t6 t7 t8 t9 t10 tp1 tp2 tp3 tp4 tp5 tp6 tp7 tp8 tp9 tp10 tp11 tp12 tp13 tp14 tp15 tp16 tp17 tp18 tp19 tp20 tp21 tp22 tp23 tp24 tp25 tp26 tp27 tp28 tp29 tp30 tp31 tp32 tp33 tp34 tp35 tp36 tp37 tp38 tp39 tp40 tp41 tp42 tp43 tp44 tp45 tp46 tp47 tp48 tp49 tp50 tp51 tp52 tp53 tp54 tp55 tp56 tp57 tp58 tp59 tp60 tp61 tp62 tp63 tp64 tp65 tp66 tp67 tp68 tp69 tp70 tp71 tp72 tp73 tp74 tp75 tp76 tp77 tp78 tp79 tp80 tp81 tp82 ) wget http://www.phobos.ro/roric/texts/xml/$i.xml end wget http://www.phobos.ro/roric/texts/xml/dga.dtd
Licensing terms are unknown. I tend to interpret their “public availability” in the following way:
- any usage, commercial or not
- modification and redistribution under a free license permitted, provided the original source is mentioned
- citation in publications not required (but it is common decency)
The texts were prepared by members of RORIC-LING, Faculty of Mathematics and Computer Science (Facultatea de Matematica şi Informatica), University of Bucharest (Universitatea din Bucureşti), Str. Academiei nr. 14, sector 1, RO-010014, Bucureşti, Romania.
References
- Website
- Data
- no separate citation
- Principal publications
- Florentina Hristea, Marius Popescu: Gramatici de dependenţă şi gramatici WG, pp. 233-246.
- Documentation
Domain
Newspaper.
Size
The corpus contains 36150 tokens in 4042 clauses, yielding 8.94 tokens per clause on average. There is no official training-test data split. We use the files t1.xml
– t10.xml
(2640 tokens / 266 clauses) for testing and the rest (33510 tokens / 3776 clauses) for training of our HamleDT experiments.
Inside
Sentences have been segmented into clauses and there is a separate tree for each clause. There are no punctuation nodes, punctuation has been removed. The text lacks diacritical marks, i.e. the Romanian letters ă, â, î, ş, ţ have been replaced by a, a, i, s, t respectively.
There are part-of-speech tags but no lemmas and no morphological features (gender, number, case etc.) The part-of-speech tags were probably assigned manually, as well as the syntactic structure.
Sample
The first sentence of our training data:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE DGAdoc SYSTEM "dga.dtd"> <DGAdoc> <s> <tok> <orth>Marii</orth> <ordno>1</ordno> <ctag>adjectiv</ctag> <syn> <head>2</head> <reltype>atribut adj.</reltype> </syn> </tok> <tok> <orth>bancheri</orth> <ordno>2</ordno> <ctag>substantiv</ctag> <syn> <head>6</head> <reltype>subiect</reltype> </syn> </tok> <tok> <orth>ai</orth> <ordno>3</ordno> <ctag>art. poses.</ctag> <syn> <head>2</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>lumii</orth> <ordno>4</ordno> <ctag>substantiv</ctag> <syn> <head>3</head> <reltype>rel. poses.</reltype> </syn> </tok> <tok> <orth>au</orth> <ordno>5</ordno> <ctag>verb aux.</ctag> <syn> <head>6</head> <reltype>rel. aux.</reltype> </syn> </tok> <tok> <orth>reusit</orth> <ordno>6</ordno> <ctag>verb</ctag> <syn> <head>7</head> <reltype>predicat</reltype> </syn> </tok> </s>
The first sentence of our test data:
<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE DGAdoc SYSTEM "dga.dtd"> <DGAdoc> <s> <tok> <orth>Judecatorul</orth> <ordno>1</ordno> <ctag>substantiv</ctag> <syn> <head>9</head> <reltype>subiect</reltype> </syn> </tok> <tok> <orth>sindic</orth> <ordno>2</ordno> <ctag>adjectiv</ctag> <syn> <head>1</head> <reltype>atribut adj.</reltype> </syn> </tok> <tok> <orth>IonBriac</orth> <ordno>3</ordno> <ctag>substantiv</ctag> <syn> <head>1</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>de</orth> <ordno>4</ordno> <ctag>prepozitie</ctag> <syn> <head>1</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>la</orth> <ordno>5</ordno> <ctag>prepozitie</ctag> <syn> <head>4</head> <reltype>rel. prepoz.</reltype> </syn> </tok> <tok> <orth>Tribunalul</orth> <ordno>6</ordno> <ctag>substantiv</ctag> <syn> <head>4</head> <reltype>rel. prepoz.</reltype> </syn> </tok> <tok> <orth>CarasSeverin</orth> <ordno>7</ordno> <ctag>substantiv</ctag> <syn> <head>6</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>a</orth> <ordno>8</ordno> <ctag>verb aux.</ctag> <syn> <head>9</head> <reltype>rel. aux.</reltype> </syn> </tok> <tok> <orth>decis</orth> <ordno>9</ordno> <ctag>verb</ctag> <syn> <head>19</head> <reltype>predicat</reltype> </syn> </tok> <tok> <orth>ieri</orth> <ordno>10</ordno> <ctag>adverb</ctag> <syn> <head>9</head> <reltype>complement circumst.</reltype> </syn> </tok> <tok> <orth>declansarea</orth> <ordno>11</ordno> <ctag>substantiv</ctag> <syn> <head>9</head> <reltype>complement dir.</reltype> </syn> </tok> <tok> <orth>reorganizarii</orth> <ordno>12</ordno> <ctag>substantiv</ctag> <syn> <head>11</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>judiciare</orth> <ordno>13</ordno> <ctag>adjectiv</ctag> <syn> <head>12</head> <reltype>atribut adj.</reltype> </syn> </tok> <tok> <orth>a</orth> <ordno>14</ordno> <ctag>art. poses.</ctag> <syn> <head>12</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>Combinatului</orth> <ordno>15</ordno> <ctag>substantiv</ctag> <syn> <head>14</head> <reltype>rel. poses.</reltype> </syn> </tok> <tok> <orth>Siderurgic</orth> <ordno>16</ordno> <ctag>adjectiv</ctag> <syn> <head>15</head> <reltype>atribut adj.</reltype> </syn> </tok> <tok> <orth>din</orth> <ordno>17</ordno> <ctag>prepozitie</ctag> <syn> <head>15</head> <reltype>atribut subst.</reltype> </syn> </tok> <tok> <orth>Resita</orth> <ordno>18</ordno> <ctag>substantiv</ctag> <syn> <head>17</head> <reltype>rel. prepoz.</reltype> </syn> </tok> </s>
Parsing
The corpus is projective.
I am not aware of any published evaluation of parsing accuracy on this data.