user:zeman:treebanks:la [ufal wiki]

This is an old revision of the document!

Latin (la)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Latin (la)

Versions

LDT 1.5

Obtaining and License

The LDT is freely downloadable from here under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 license. The license in short:

non-commercial usage
modification and redistribution permitted under the same license
linking to the treebank website and citing the principal publication in publications required

LDT was created by volunteering students and researchers from across the world. It is part of the Perseus Digital Library, a project on classical languages, hosted at the Tufts University, Medford, MA 02155, Massachusetts, USA.

References

Website
- http://nlp.perseus.tufts.edu/syntax/treebank/
Data
- no separate citation
Principal publications
- David Bamman, Gregory Crane: The Ancient Greek and Latin Dependency Treebanks. In: Caroline Sporleder, Antal van den Bosch, Kalliopi Zervanou (eds.): Language Technology for Cultural Heritage, ser. Foundations of Human Language Processing and Technology. Springer, Berlin / Heidelberg, Germany, 2011.
Documentation
- XML schema for the file format (also included in the distribution)
- David Bamman, Marco Passarotti, Gregory Crane, Savina Raynaud: Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3). Tufts University, Medford, Massachusetts, USA, 2007.
- Morphosyntactic tags are described in the README file.

Domain

Caesar: Bello Gallico Book 2 selections (50 BC); Cicero: In Catilinam 1.1-2.11 (63 BC); Jerome: Vulgate: Apocalypse (AD 400); Ovid: Metamorphoses: Book I (AD 8); Petronius: Satyricon 26-78 (Cena Trimalchionis) (AD 60); Propertius: Elegies: Book I (25 BC); Sallust: Catilina (63 BC); Vergil: Aeneid (Book 6 selections) (19 BC).

Size

LDT contains 53143 tokens in 3473 non-empty sentences, yielding 15.30 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the medium-sized file called 1999.02.0029.xml (4789 tokens / 316 sentences; Ovid: Metamorphoses) for testing and the rest (48354 tokens / 3157 sentences) for training.

Inside

The native file format of the treebank is based on XML. Greek letters are romanized using Beta Code, a romanization scheme used widely not only in the Perseus project. It can be mapped 1-1 on the original Greek letters in UTF-8; however, embedded non-Greek words (such as the lemmas “comma” and “other”) cannot be identified automatically (and we do not want to decode them).

Morphological annotation consists of lemma and nine-character positional morphosyntactic tag. Disambiguation has been done manually (gold standard).

The syntactic annotation style is very similar to that of the Prague Dependency Treebank. The syntactic tags (analytical functions) are almost identical, too. However, in AGDT some combined values are permitted that are not valid in PDT, e.g. ATR_AP_ExD0_APOS.

Sample

The first sentence of the corpus in its native XML format:

<?xml version="1.0"?>
<treebank version="1.5"
	xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
	xmlns:treebank="http://nlp.perseus.tufts.edu/syntax/treebank/1.5"
	xsi:schemaLocation="http://nlp.perseus.tufts.edu/syntax/treebank/1.5 treebank-1.5.xsd">
	<sentence id="1" document_id="Perseus:text:1999.02.0002" subdoc="Book=2:chapter=1" span="Cum0:dare0">
		<word id="1" form="Cum" lemma="cum1" postag="c--------" head="20" relation="AuxC" />
		<word id="2" form="esset" lemma="sum1" postag="v3sisa---" head="1" relation="ADV" />
		<word id="3" form="Caesar" lemma="Caesar1" postag="n-s---mn-" head="2" relation="SBJ" />
		<word id="4" form="in" lemma="in1" postag="r--------" head="2" relation="AuxP" />
		<word id="5" form="citeriore" lemma="citer1" postag="a-s---fbc" head="6" relation="ATR" />
		<word id="6" form="Gallia" lemma="Gallia1" postag="n-s---fb-" head="4" relation="ADV" />
		<word id="7" form="in" lemma="in1" postag="r--------" head="2" relation="AuxP" />
		<word id="8" form="hibernis" lemma="hibernus1" postag="n-p---nb-" head="7" relation="ADV" />
		<word id="9" form="," lemma="comma1" postag="u--------" head="13" relation="AuxX" />
		<word id="10" form="ita" lemma="ita1" postag="d--------" head="2" relation="AuxY" />
		<word id="11" form="uti" lemma="uti1" postag="c--------" head="10" relation="AuxC" />
		<word id="12" form="supra" lemma="supra1" postag="d--------" head="13" relation="ADV" />
		<word id="13" form="demonstravimus" lemma="demonstro1" postag="v1pria---" head="11" relation="ADV" />
		<word id="14" form="," lemma="comma1" postag="u--------" head="13" relation="AuxX" />
		<word id="15" form="crebri" lemma="creber1" postag="a-p---mn-" head="18" relation="ATR" />
		<word id="16" form="ad" lemma="ad1" postag="r--------" head="19" relation="AuxP" />
		<word id="17" form="eum" lemma="is1" postag="p-s---ma-" head="16" relation="OBJ" />
		<word id="18" form="rumores" lemma="rumor1" postag="n-p---mn-" head="19" relation="SBJ" />
		<word id="19" form="adferebantur" lemma="affero1" postag="v3piip---" head="20" relation="PRED_CO" />
		<word id="20" form="que" lemma="que1" postag="c--------" head="0" relation="COORD" />
		<word id="21" form="litteris" lemma="littera1" postag="n-p---fb-" head="25" relation="ADV" />
		<word id="22" form="item" lemma="item1" postag="d--------" head="21" relation="AuxZ" />
		<word id="23" form="Labieni" lemma="Labienus1" postag="n-s---mg-" head="21" relation="ATR" />
		<word id="24" form="certior" lemma="certus1" postag="a-s---mnc" head="25" relation="PNOM" />
		<word id="25" form="fiebat" lemma="fio1" postag="v3s-ia---" head="20" relation="PRED_CO" />
		<word id="26" form="omnes" lemma="omnis1" postag="a-p---ma-" head="27" relation="ATR" />
		<word id="27" form="Belgas" lemma="Belgae1" postag="n-p---ma-" head="40" relation="SBJ" />
		<word id="28" form="," lemma="comma1" postag="u--------" head="34" relation="AuxX" />
		<word id="29" form="quam" lemma="qui1" postag="p-s---fa-" head="31" relation="SBJ" />
		<word id="30" form="tertiam" lemma="tertius1" postag="a-s---fa-" head="33" relation="ATR" />
		<word id="31" form="esse" lemma="sum1" postag="v--pna---" head="34" relation="OBJ" />
		<word id="32" form="Galliae" lemma="Gallia1" postag="n-s---fg-" head="33" relation="ATR" />
		<word id="33" form="partem" lemma="pars1" postag="n-s---fa-" head="31" relation="PNOM" />
		<word id="34" form="dixeramus" lemma="dico2" postag="v1plia---" head="27" relation="ATR" />
		<word id="35" form="," lemma="comma1" postag="u--------" head="34" relation="AuxX" />
		<word id="36" form="contra" lemma="contra1" postag="r--------" head="39" relation="AuxP" />
		<word id="37" form="populum" lemma="populus1" postag="n-s---ma-" head="36" relation="ADV" />
		<word id="38" form="Romanum" lemma="Romanus1" postag="a-s---ma-" head="37" relation="ATR" />
		<word id="39" form="coniurare" lemma="conjuro1" postag="v--pna---" head="40" relation="OBJ_CO" />
		<word id="40" form="que" lemma="que1" postag="c--------" head="24" relation="COORD" />
		<word id="41" form="obsides" lemma="obses1" postag="n-p---ma-" head="44" relation="OBJ" />
		<word id="42" form="inter" lemma="inter1" postag="r--------" head="44" relation="AuxP" />
		<word id="43" form="se" lemma="sui1" postag="p-p---ma-" head="42" relation="OBJ" />
		<word id="44" form="dare" lemma="do1" postag="v--pna---" head="40" relation="OBJ_CO" />
	</sentence>

The first sentence of the corpus converted to the CoNLL format:

1	Cum	cum1	c	c	pos=c\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	20	AuxC	_	_
2	esset	sum1	v	v	pos=v\|per=3\|num=s\|ten=i\|mod=s\|voi=a\|gen=-\|cas=-\|deg=-	1	ADV	_	_
3	Caesar	Caesar1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=n\|deg=-	2	SBJ	_	_
4	in	in1	r	r	pos=r\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	2	AuxP	_	_
5	citeriore	citer1	a	a	pos=a\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=f\|cas=b\|deg=c	6	ATR	_	_
6	Gallia	Gallia1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=f\|cas=b\|deg=-	4	ADV	_	_
7	in	in1	r	r	pos=r\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	2	AuxP	_	_
8	hibernis	hibernus1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=n\|cas=b\|deg=-	7	ADV	_	_
9	,	comma1	u	u	pos=u\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	13	AuxX	_	_
10	ita	ita1	d	d	pos=d\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	2	AuxY	_	_
11	uti	uti1	c	c	pos=c\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	10	AuxC	_	_
12	supra	supra1	d	d	pos=d\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	13	ADV	_	_
13	demonstravimus	demonstro1	v	v	pos=v\|per=1\|num=p\|ten=r\|mod=i\|voi=a\|gen=-\|cas=-\|deg=-	11	ADV	_	_
14	,	comma1	u	u	pos=u\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	13	AuxX	_	_
15	crebri	creber1	a	a	pos=a\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=m\|cas=n\|deg=-	18	ATR	_	_
16	ad	ad1	r	r	pos=r\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	19	AuxP	_	_
17	eum	is1	p	p	pos=p\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	16	OBJ	_	_
18	rumores	rumor1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=m\|cas=n\|deg=-	19	SBJ	_	_
19	adferebantur	affero1	v	v	pos=v\|per=3\|num=p\|ten=i\|mod=i\|voi=p\|gen=-\|cas=-\|deg=-	20	PRED_CO	_	_
20	que	que1	c	c	pos=c\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	0	COORD	_	_
21	litteris	littera1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=f\|cas=b\|deg=-	25	ADV	_	_
22	item	item1	d	d	pos=d\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	21	AuxZ	_	_
23	Labieni	Labienus1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=g\|deg=-	21	ATR	_	_
24	certior	certus1	a	a	pos=a\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=n\|deg=c	25	PNOM	_	_
25	fiebat	fio1	v	v	pos=v\|per=3\|num=s\|ten=-\|mod=i\|voi=a\|gen=-\|cas=-\|deg=-	20	PRED_CO	_	_
26	omnes	omnis1	a	a	pos=a\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	27	ATR	_	_
27	Belgas	Belgae1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	40	SBJ	_	_
28	,	comma1	u	u	pos=u\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	34	AuxX	_	_
29	quam	qui1	p	p	pos=p\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=f\|cas=a\|deg=-	31	SBJ	_	_
30	tertiam	tertius1	a	a	pos=a\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=f\|cas=a\|deg=-	33	ATR	_	_
31	esse	sum1	v	v	pos=v\|per=-\|num=-\|ten=p\|mod=n\|voi=a\|gen=-\|cas=-\|deg=-	34	OBJ	_	_
32	Galliae	Gallia1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=f\|cas=g\|deg=-	33	ATR	_	_
33	partem	pars1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=f\|cas=a\|deg=-	31	PNOM	_	_
34	dixeramus	dico2	v	v	pos=v\|per=1\|num=p\|ten=l\|mod=i\|voi=a\|gen=-\|cas=-\|deg=-	27	ATR	_	_
35	,	comma1	u	u	pos=u\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	34	AuxX	_	_
36	contra	contra1	r	r	pos=r\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	39	AuxP	_	_
37	populum	populus1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	36	ADV	_	_
38	Romanum	Romanus1	a	a	pos=a\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	37	ATR	_	_
39	coniurare	conjuro1	v	v	pos=v\|per=-\|num=-\|ten=p\|mod=n\|voi=a\|gen=-\|cas=-\|deg=-	40	OBJ_CO	_	_
40	que	que1	c	c	pos=c\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	24	COORD	_	_
41	obsides	obses1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	44	OBJ	_	_
42	inter	inter1	r	r	pos=r\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	44	AuxP	_	_
43	se	sui1	p	p	pos=p\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=m\|cas=a\|deg=-	42	OBJ	_	_
44	dare	do1	v	v	pos=v\|per=-\|num=-\|ten=p\|mod=n\|voi=a\|gen=-\|cas=-\|deg=-	40	OBJ_CO	_	_

The first sentence of the HamleDT test data in the CoNLL format:

1	In	in1	r	r	pos=r\|per=-\|num=-\|ten=-\|mod=-\|voi=-\|gen=-\|cas=-\|deg=-	5	AuxP	_	_
2	nova	novus1	a	a	pos=a\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=n\|cas=a\|deg=-	8	ATR	_	_
3	fert	fero1	v	v	pos=v\|per=3\|num=s\|ten=p\|mod=i\|voi=a\|gen=-\|cas=-\|deg=-	0	PRED	_	_
4	animus	animus1	n	n	pos=n\|per=-\|num=s\|ten=-\|mod=-\|voi=-\|gen=m\|cas=n\|deg=-	3	SBJ	_	_
5	mutatas	muto1	t	t	pos=t\|per=-\|num=p\|ten=r\|mod=p\|voi=p\|gen=f\|cas=a\|deg=-	7	ATR	_	_
6	dicere	dico2	v	v	pos=v\|per=-\|num=-\|ten=p\|mod=n\|voi=a\|gen=-\|cas=-\|deg=-	3	OBJ	_	_
7	formas	forma1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=f\|cas=a\|deg=-	6	OBJ	_	_
8	corpora	corpus1	n	n	pos=n\|per=-\|num=p\|ten=-\|mod=-\|voi=-\|gen=n\|cas=a\|deg=-	1	OBJ	_	_

Parsing

AGDT is an extremely nonprojective treebank, exceeding the nonprojectivity level found in other treebanks by an order of magnitude. 60469 out of the total 308,882 tokens are attached nonprojectively (19.58%).

I am not aware of any published evaluation of Ancient Greek parsing accuracy.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents