[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Table of Contents

Latin (la)

Latin Dependency Treebank (LDT)

Versions

Obtaining and License

The LDT is freely downloadable from here under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 license. The license in short:

LDT was created by volunteering students and researchers from across the world. It is part of the Perseus Digital Library, a project on classical languages, hosted at the Tufts University, Medford, MA 02155, Massachusetts, USA.

References

Domain

Caesar: Bello Gallico Book 2 selections (50 BC); Cicero: In Catilinam 1.1-2.11 (63 BC); Jerome: Vulgate: Apocalypse (AD 400); Ovid: Metamorphoses: Book I (AD 8); Petronius: Satyricon 26-78 (Cena Trimalchionis) (AD 60); Propertius: Elegies: Book I (25 BC); Sallust: Catilina (63 BC); Vergil: Aeneid (Book 6 selections) (19 BC).

Size

LDT contains 53143 tokens in 3473 non-empty sentences, yielding 15.30 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the medium-sized file called 1999.02.0029.xml (4789 tokens / 316 sentences; Ovid: Metamorphoses) for testing and the rest (48354 tokens / 3157 sentences) for training.

Inside

The native file format of the treebank is based on XML. Greek letters are romanized using Beta Code, a romanization scheme used widely not only in the Perseus project. It can be mapped 1-1 on the original Greek letters in UTF-8; however, embedded non-Greek words (such as the lemmas “comma” and “other”) cannot be identified automatically (and we do not want to decode them).

Morphological annotation consists of lemma and nine-character positional morphosyntactic tag. Disambiguation has been done manually (gold standard).

The syntactic annotation style is very similar to that of the Prague Dependency Treebank. The syntactic tags (analytical functions) are almost identical, too. However, in AGDT some combined values are permitted that are not valid in PDT, e.g. ATR_AP_ExD0_APOS.

Sample

The first sentence of the corpus in its native XML format:

<?xml version="1.0"?>
<treebank version="1.5"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:treebank="http://nlp.perseus.tufts.edu/syntax/treebank/1.5"
	xsi:schemaLocation="http://nlp.perseus.tufts.edu/syntax/treebank/1.5 treebank-1.5.xsd">
	<sentence id="1" document_id="Perseus:text:1999.02.0002" subdoc="Book=2:chapter=1" span="Cum0:dare0">
		<word id="1" form="Cum" lemma="cum1" postag="c--------" head="20" relation="AuxC" />
		<word id="2" form="esset" lemma="sum1" postag="v3sisa---" head="1" relation="ADV" />
		<word id="3" form="Caesar" lemma="Caesar1" postag="n-s---mn-" head="2" relation="SBJ" />
		<word id="4" form="in" lemma="in1" postag="r--------" head="2" relation="AuxP" />
		<word id="5" form="citeriore" lemma="citer1" postag="a-s---fbc" head="6" relation="ATR" />
		<word id="6" form="Gallia" lemma="Gallia1" postag="n-s---fb-" head="4" relation="ADV" />
		<word id="7" form="in" lemma="in1" postag="r--------" head="2" relation="AuxP" />
		<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV" />
		<word id="9" form="," lemma="comma1" postag="u--------" head="13" relation="AuxX" />
		<word id="10" form="ita" lemma="ita1" postag="d--------" head="2" relation="AuxY" />
		<word id="11" form="uti" lemma="uti1" postag="c--------" head="10" relation="AuxC" />
		<word id="12" form="supra" lemma="supra1" postag="d--------" head="13" relation="ADV" />
		<word id="13" form="demonstravimus" lemma="demonstro1" postag="v1pria---" head="11" relation="ADV" />
		<word id="14" form="," lemma="comma1" postag="u--------" head="13" relation="AuxX" />
		<word id="15" form="crebri" lemma="creber1" postag="a-p---mn-" head="18" relation="ATR" />
		<word id="16" form="ad" lemma="ad1" postag="r--------" head="19" relation="AuxP" />
		<word id="17" form="eum" lemma="is1" postag="p-s---ma-" head="16" relation="OBJ" />
		<word id="18" form="rumores" lemma="rumor1" postag="n-p---mn-" head="19" relation="SBJ" />
		<word id="19" form="adferebantur" lemma="affero1" postag="v3piip---" head="20" relation="PRED_CO" />
		<word id="20" form="que" lemma="que1" postag="c--------" head="0" relation="COORD" />
		<word id="21" form="litteris" lemma="littera1" postag="n-p---fb-" head="25" relation="ADV" />
		<word id="22" form="item" lemma="item1" postag="d--------" head="21" relation="AuxZ" />
		<word id="23" form="Labieni" lemma="Labienus1" postag="n-s---mg-" head="21" relation="ATR" />
		<word id="24" form="certior" lemma="certus1" postag="a-s---mnc" head="25" relation="PNOM" />
		<word id="25" form="fiebat" lemma="fio1" postag="v3s-ia---" head="20" relation="PRED_CO" />
		<word id="26" form="omnes" lemma="omnis1" postag="a-p---ma-" head="27" relation="ATR" />
		<word id="27" form="Belgas" lemma="Belgae1" postag="n-p---ma-" head="40" relation="SBJ" />
		<word id="28" form="," lemma="comma1" postag="u--------" head="34" relation="AuxX" />
		<word id="29" form="quam" lemma="qui1" postag="p-s---fa-" head="31" relation="SBJ" />
		<word id="30" form="tertiam" lemma="tertius1" postag="a-s---fa-" head="33" relation="ATR" />
		<word id="31" form="esse" lemma="sum1" postag="v--pna---" head="34" relation="OBJ" />
		<word id="32" form="Galliae" lemma="Gallia1" postag="n-s---fg-" head="33" relation="ATR" />
		<word id="33" form="partem" lemma="pars1" postag="n-s---fa-" head="31" relation="PNOM" />
		<word id="34" form="dixeramus" lemma="dico2" postag="v1plia---" head="27" relation="ATR" />
		<word id="35" form="," lemma="comma1" postag="u--------" head="34" relation="AuxX" />
		<word id="36" form="contra" lemma="contra1" postag="r--------" head="39" relation="AuxP" />
		<word id="37" form="populum" lemma="populus1" postag="n-s---ma-" head="36" relation="ADV" />
		<word id="38" form="Romanum" lemma="Romanus1" postag="a-s---ma-" head="37" relation="ATR" />
		<word id="39" form="coniurare" lemma="conjuro1" postag="v--pna---" head="40" relation="OBJ_CO" />
		<word id="40" form="que" lemma="que1" postag="c--------" head="24" relation="COORD" />
		<word id="41" form="obsides" lemma="obses1" postag="n-p---ma-" head="44" relation="OBJ" />
		<word id="42" form="inter" lemma="inter1" postag="r--------" head="44" relation="AuxP" />
		<word id="43" form="se" lemma="sui1" postag="p-p---ma-" head="42" relation="OBJ" />
		<word id="44" form="dare" lemma="do1" postag="v--pna---" head="40" relation="OBJ_CO" />
	</sentence>

The first sentence of the corpus converted to the CoNLL format:

1 Cum cum1 c c pos=c|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 20 AuxC _ _
2 esset sum1 v v pos=v|per=3|num=s|ten=i|mod=s|voi=a|gen=-|cas=-|deg=- 1 ADV _ _
3 Caesar Caesar1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=n|deg=- 2 SBJ _ _
4 in in1 r r pos=r|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 2 AuxP _ _
5 citeriore citer1 a a pos=a|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=b|deg=c 6 ATR _ _
6 Gallia Gallia1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=b|deg=- 4 ADV _ _
7 in in1 r r pos=r|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 2 AuxP _ _
8 hibernis hibernus1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=n|cas=b|deg=- 7 ADV _ _
9 , comma1 u u pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 13 AuxX _ _
10 ita ita1 d d pos=d|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 2 AuxY _ _
11 uti uti1 c c pos=c|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 10 AuxC _ _
12 supra supra1 d d pos=d|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 13 ADV _ _
13 demonstravimus demonstro1 v v pos=v|per=1|num=p|ten=r|mod=i|voi=a|gen=-|cas=-|deg=- 11 ADV _ _
14 , comma1 u u pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 13 AuxX _ _
15 crebri creber1 a a pos=a|per=-|num=p|ten=-|mod=-|voi=-|gen=m|cas=n|deg=- 18 ATR _ _
16 ad ad1 r r pos=r|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 19 AuxP _ _
17 eum is1 p p pos=p|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 16 OBJ _ _
18 rumores rumor1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=m|cas=n|deg=- 19 SBJ _ _
19 adferebantur affero1 v v pos=v|per=3|num=p|ten=i|mod=i|voi=p|gen=-|cas=-|deg=- 20 PRED_CO _ _
20 que que1 c c pos=c|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 0 COORD _ _
21 litteris littera1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=f|cas=b|deg=- 25 ADV _ _
22 item item1 d d pos=d|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 21 AuxZ _ _
23 Labieni Labienus1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=g|deg=- 21 ATR _ _
24 certior certus1 a a pos=a|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=n|deg=c 25 PNOM _ _
25 fiebat fio1 v v pos=v|per=3|num=s|ten=-|mod=i|voi=a|gen=-|cas=-|deg=- 20 PRED_CO _ _
26 omnes omnis1 a a pos=a|per=-|num=p|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 27 ATR _ _
27 Belgas Belgae1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 40 SBJ _ _
28 , comma1 u u pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 34 AuxX _ _
29 quam qui1 p p pos=p|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=a|deg=- 31 SBJ _ _
30 tertiam tertius1 a a pos=a|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=a|deg=- 33 ATR _ _
31 esse sum1 v v pos=v|per=-|num=-|ten=p|mod=n|voi=a|gen=-|cas=-|deg=- 34 OBJ _ _
32 Galliae Gallia1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=g|deg=- 33 ATR _ _
33 partem pars1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=a|deg=- 31 PNOM _ _
34 dixeramus dico2 v v pos=v|per=1|num=p|ten=l|mod=i|voi=a|gen=-|cas=-|deg=- 27 ATR _ _
35 , comma1 u u pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 34 AuxX _ _
36 contra contra1 r r pos=r|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 39 AuxP _ _
37 populum populus1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 36 ADV _ _
38 Romanum Romanus1 a a pos=a|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 37 ATR _ _
39 coniurare conjuro1 v v pos=v|per=-|num=-|ten=p|mod=n|voi=a|gen=-|cas=-|deg=- 40 OBJ_CO _ _
40 que que1 c c pos=c|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 24 COORD _ _
41 obsides obses1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 44 OBJ _ _
42 inter inter1 r r pos=r|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 44 AuxP _ _
43 se sui1 p p pos=p|per=-|num=p|ten=-|mod=-|voi=-|gen=m|cas=a|deg=- 42 OBJ _ _
44 dare do1 v v pos=v|per=-|num=-|ten=p|mod=n|voi=a|gen=-|cas=-|deg=- 40 OBJ_CO _ _

The first sentence of the HamleDT test data in the CoNLL format:

1 In in1 r r pos=r|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- 5 AuxP _ _
2 nova novus1 a a pos=a|per=-|num=p|ten=-|mod=-|voi=-|gen=n|cas=a|deg=- 8 ATR _ _
3 fert fero1 v v pos=v|per=3|num=s|ten=p|mod=i|voi=a|gen=-|cas=-|deg=- 0 PRED _ _
4 animus animus1 n n pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=m|cas=n|deg=- 3 SBJ _ _
5 mutatas muto1 t t pos=t|per=-|num=p|ten=r|mod=p|voi=p|gen=f|cas=a|deg=- 7 ATR _ _
6 dicere dico2 v v pos=v|per=-|num=-|ten=p|mod=n|voi=a|gen=-|cas=-|deg=- 3 OBJ _ _
7 formas forma1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=f|cas=a|deg=- 6 OBJ _ _
8 corpora corpus1 n n pos=n|per=-|num=p|ten=-|mod=-|voi=-|gen=n|cas=a|deg=- 1 OBJ _ _

Parsing

AGDT is an extremely nonprojective treebank, exceeding the nonprojectivity level found in other treebanks by an order of magnitude. 60469 out of the total 308,882 tokens are attached nonprojectively (19.58%).

I am not aware of any published evaluation of Ancient Greek parsing accuracy.


[ Back to the navigation ] [ Back to the content ]