Table of Contents
Ancient Greek (grc)
Versions
- AGDT 1.2
Obtaining and License
The AGDT is freely downloadable from here under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 license. The license in short:
- non-commercial usage
- modification and redistribution permitted under the same license
- linking to the treebank website and citing the principal publication in publications required
AGDT was created by volunteering students and researchers from across the world. It is part of the Perseus Digital Library, a project on classical languages, hosted at the Tufts University, Medford, MA 02155, Massachusetts, USA.
References
- Website
- Data
- no separate citation
- Principal publications
- David Bamman, Gregory Crane: The Ancient Greek and Latin Dependency Treebanks. In: Caroline Sporleder, Antal van den Bosch, Kalliopi Zervanou (eds.): Language Technology for Cultural Heritage, ser. Foundations of Human Language Processing and Technology. Springer, Berlin / Heidelberg, Germany, 2011.
- Documentation
- XML schema for the file format (also included in the distribution)
- David Bamman, Gregory Crane: Guidelines for the Syntactic Annotation of the Ancient Greek Dependency Treebank (1.1). Tufts University, Medford, Massachusetts, USA, 2008.
- Morphosyntactic tags are described in the README file.
Domain
Homer: Illiad (750 BC), Odyssey (700 BC); Hesiod (650 BC), Aeschylus (500 BC); Sophocles: Ajax (445 BC).
Size
AGDT contains 308,882 tokens in 21160 non-empty sentences, yielding 14.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the smallest file called 1999.01.0015.xml
(5925 tokens / 528 sentences; Aeschylus: Suppliants) for testing and the rest (302,957 tokens / 20632 sentences) for training.
Inside
The native file format of the treebank is based on XML. Greek letters are romanized using Beta Code, a romanization scheme used widely not only in the Perseus project. It can be mapped 1-1 on the original Greek letters in UTF-8; however, embedded non-Greek words (such as the lemmas “comma” and “other”) cannot be identified automatically (and we do not want to decode them).
Morphological annotation consists of lemma and nine-character positional morphosyntactic tag. Disambiguation has been done manually (gold standard).
The syntactic annotation style is very similar to that of the Prague Dependency Treebank. The syntactic tags (analytical functions) are almost identical, too. However, in AGDT some combined values are permitted that are not valid in PDT, e.g. ATR_AP_ExD0_APOS
.
Sample
The first sentence of the corpus in its native XML format:
<?xml version="1.0"?> <treebank version="1.2" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns:treebank="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" xsi:schemaLocation="http://nlp.perseus.tufts.edu/syntax/treebank/1.5 treebank-1.5.xsd" xml:lang="grc"> <date>Wed Sep 29 12:03:38 EDT 2010</date> <annotator> <short>FrancescoM</short> <name>Francesco Mambrini</name> <address>Tufts University, Medford, MA, USA</address> </annotator> <sentence id="2185285" document_id="Perseus:text:1999.01.0003" subdoc="card=1" span="qeou\s0:.0"> <annotator>FrancescoM</annotator> <word id="1" cid="32749174" form="qeou\s" lemma="qeo/s1" postag="n-p---ma-" head="3" relation="OBJ" /> <word id="2" cid="32749175" form="me\n" lemma="me/n1" postag="g--------" head="3" relation="AuxY" /> <word id="3" cid="32749176" form="ai)tw=" lemma="ai)te/w1" postag="v1spia---" head="0" relation="PRED" /> <word id="4" cid="32749177" form="tw=nd'" lemma="o(/de1" postag="p-p---mg-" head="6" relation="ATR" /> <word id="5" cid="32749178" form="a)pallagh\n" lemma="a)pallagh/1" postag="n-s---fa-" head="3" relation="OBJ" /> <word id="6" cid="32749179" form="po/nwn" lemma="po/nos1" postag="n-p---mg-" head="5" relation="ATR_AP_ExD0_APOS" /> <word id="7" cid="32749180" form="froura=s" lemma="froura/1" postag="n-s---fg-" head="5" relation="ATR_AP_ExD0_APOS" /> <word id="8" cid="32749181" form="e)tei/as" lemma="e)/teios1" postag="a-s---fg-" head="7" relation="ATR" /> <word id="9" cid="32749182" form="mh=kos" lemma="mh=kos1" postag="n-s---na-" head="8" relation="ATR" /> <word id="10" cid="32749183" form="," lemma="comma1" postag="u--------" head="21" relation="AuxX" /> <word id="11" cid="32749184" form="h(\n" lemma="o(/s1" postag="p-s---fa-" head="12" relation="OBJ" /> <word id="12" cid="32749185" form="koimw/menos" lemma="koima/w1" postag="t-sppemn-" head="21" relation="ADV" /> <word id="13" cid="32749186" form="ste/gais" lemma="ste/gh1" postag="n-p---fd-" head="12" relation="ADV" /> <word id="14" cid="32749187" form="*)atreidw=n" lemma="*)atrei/dhs1" postag="n-p---mg-" head="13" relation="ATR" /> <word id="15" cid="32749188" form="a)/gkaqen" lemma="a)/gkaqen1" postag="d--------" head="16" relation="ADV_AP" /> <word id="16" cid="32749189" form="," lemma="comma1" postag="u--------" head="12" relation="APOS" /> <word id="17" cid="32749190" form="kuno\s" lemma="ku/wn1" postag="n-s---mg-" head="18" relation="ATR" /> <word id="18" cid="32749191" form="di/khn" lemma="di/kh1" postag="n-s---fa-" head="16" relation="ADV_AP" /> <word id="19" cid="32749192" form="," lemma="comma1" postag="u--------" head="16" relation="AuxX" /> <word id="20" cid="32749193" form="a)/strwn" lemma="a)/stron1" postag="n-p---ng-" head="23" relation="ATR" /> <word id="21" cid="32749194" form="ka/toida" lemma="ka/toida1" postag="v1sria---" head="7" relation="ATR" /> <word id="22" cid="32749195" form="nukte/rwn" lemma="nu/kteros1" postag="a-p---ng-" head="20" relation="ATR" /> <word id="23" cid="32749196" form="o(mh/gurin" lemma="o(mh/guris1" postag="n-s---fa-" head="25" relation="OBJ_AP_CO" /> <word id="24" cid="32749197" form="," lemma="comma1" postag="u--------" head="25" relation="AuxX" /> <word id="25" cid="32749198" form="kai\" lemma="kai/1" postag="c--------" head="38" relation="COORD" /> <word id="26" cid="32749199" form="tou\s" lemma="o(1" postag="l-p---ma-" head="33" relation="ATR" /> <word id="27" cid="32749200" form="fe/rontas" lemma="fe/rw1" postag="t-pppama-" head="33" relation="ATR" /> <word id="28" cid="32749201" form="xei=ma" lemma="xei=ma1" postag="n-s---na-" head="29" relation="OBJ_CO" /> <word id="29" cid="32749202" form="kai\" lemma="kai/1" postag="c--------" head="27" relation="COORD" /> <word id="30" cid="32749203" form="qe/ros" lemma="qe/ros1" postag="n-s---na-" head="29" relation="OBJ_CO" /> <word id="31" cid="32749204" form="brotoi=s" lemma="broto/s1" postag="n-p---md-" head="27" relation="OBJ" /> <word id="32" cid="32749205" form="lamprou\s" lemma="lampro/s1" postag="a-p---ma-" head="33" relation="ATR" /> <word id="33" cid="32749206" form="duna/stas" lemma="duna/sths1" postag="n-p---ma-" head="34" relation="OBJ_AP_CO" /> <word id="34" cid="32749207" form="," lemma="comma1" postag="---------" head="25" relation="APOS" /> <word id="35" cid="32749208" form="e)mpre/pontas" lemma="e)mpre/pw1" postag="t-pppama-" head="37" relation="ATR" /> <word id="36" cid="32749209" form="ai)qe/ri" lemma="ai)qh/r1" postag="n-s---md-" head="35" relation="OBJ" /> <word id="37" cid="32749210" form="[a)ste/ras" lemma="a)sth/r1" postag="n-p---ma-" head="34" relation="OBJ_AP_CO" /> <word id="38" cid="32749211" form="," lemma="comma1" postag="---------" head="21" relation="APOS" /> <word id="39" cid="32749212" form="o(/tan" lemma="o(/tan1" postag="c--------" head="43" relation="AuxC" /> <word id="40" cid="32749213" form="fqi/nwsin" lemma="fqi/w1" postag="v3ppsa---" head="39" relation="OBJ_AP_CO" /> <word id="41" cid="32749214" form="," lemma="comma1" postag="---------" head="43" relation="AuxX" /> <word id="42" cid="32749215" form="a)ntola/s" lemma="a)natolh/1" postag="n-p---fa-" head="43" relation="OBJ_AP_CO" /> <word id="43" cid="32749216" form="te" lemma="te1" postag="g--------" head="38" relation="COORD" /> <word id="44" cid="32749217" form="tw=n]" lemma="o(" postag="p-p---mg-" head="42" relation="ATR" /> <word id="45" cid="32749218" form="." lemma="other" postag="---------" head="0" relation="AuxK" /> </sentence>
The first sentence of the corpus converted to the CoNLL format, with Greek letters decoded (note that this is not the same sentence as above because the conversion script reorders sentences according to their sentence id):
1 | ἄσημα | ἄσημος | a | a | pos=a|per=-|num=p|ten=-|mod=-|voi=-|gen=n|cas=a|deg=- | 6 | OBJ | _ | _ |
2 | δ’ | δέ1 | g | g | pos=g|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- | 7 | AuxY | _ | _ |
3 | αὐτῶν | αὐτός | a | a | pos=a|per=-|num=p|ten=-|mod=-|voi=-|gen=n|cas=g|deg=- | 1 | ATR | _ | _ |
4 | αὐτίκ’ | αὐτίκα1 | d | d | pos=d|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- | 7 | ADV | _ | _ |
5 | ἀγνοίᾳ | ἄγνοια1 | n | n | pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=d|deg=- | 6 | ADV | _ | _ |
6 | λαβὼν | λαμβάνω1 | t | t | pos=t|per=-|num=s|ten=a|mod=p|voi=a|gen=m|cas=n|deg=- | 7 | ADV | _ | _ |
7 | ἔσθει | ἔσθω1 | v | v | pos=v|per=3|num=s|ten=p|mod=i|voi=a|gen=-|cas=-|deg=- | 0 | PRED | _ | _ |
8 | βορὰν | βορά1 | n | n | pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=a|deg=- | 7 | OBJ | _ | _ |
9 | ἄσωτον | ἄσωτος | a | a | pos=a|per=-|num=s|ten=-|mod=-|voi=-|gen=f|cas=a|deg=- | 8 | ATR | _ | _ |
10 | , | comma1 | u | u | pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- | 11 | AuxX | _ | _ |
11 | ὡς | ὡς | d | d | pos=d|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- | 9 | AuxC | _ | _ |
12 | ὁρᾷς | ὁράω1 | v | v | pos=v|per=2|num=s|ten=p|mod=i|voi=a|gen=-|cas=-|deg=- | 11 | ADV | _ | _ |
13 | , | comma1 | u | u | pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- | 11 | AuxX | _ | _ |
14 | γένει | γένος | n | n | pos=n|per=-|num=s|ten=-|mod=-|voi=-|gen=n|cas=d|deg=- | 9 | ADV | _ | _ |
15 | . | period1 | u | u | pos=u|per=-|num=-|ten=-|mod=-|voi=-|gen=-|cas=-|deg=- | 0 | AuxK | _ | _ |
Parsing
AGDT is an extremely nonprojective treebank, exceeding the nonprojectivity level found in other treebanks by an order of magnitude. 60469 out of the total 308,882 tokens are attached nonprojectively (19.58%).
I am not aware of any published evaluation of Ancient Greek parsing accuracy.