Both sides previous revision
Previous revision
Next revision
|
Previous revision
Next revision
Both sides next revision
|
user:zeman:treebanks:grc [2011/12/05 23:03] zeman References. |
user:zeman:treebanks:grc [2011/12/06 15:02] zeman |
==== Domain ==== | ==== Domain ==== |
| |
Mixed (Wikipedia, Wikinews, university web-magazine and blogs). | Homer: Illiad (750 BC), Odyssey (700 BC); Hesiod (650 BC), Aeschylus (500 BC); Sophocles: Ajax (445 BC). |
| |
==== Size ==== | ==== Size ==== |
| |
TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens / 430 sentences) for testing. | AGDT contains 309,092 tokens in 21165 sentences, yielding 14.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the smallest file called ''1999.01.0015.xml'' (5949 tokens / 529 sentences; Aeschylus: //Suppliants//) for testing and the rest (303,143 tokens / 20636 sentences) for training. |
| |
==== Inside ==== | ==== Inside ==== |
| |
The native file format of the treebank is based on XML. Besides that, TDT is also distributed in the [[:format-conll|CoNLL-X format]]. The part-of-speech tag AND the morphosyntactic features are joined in one feature string, which is copied in both the CPOS and the POS columns of the CoNLL format. The FEAT column is empty (i.e. it contains the underscore character). Lemmas are available, too. Morphological annotation and disambiguation is automatic, it is no gold standard. The native XML format shows all morphological readings of every word based on the lexicon, and the disambiguation is left upon the user. | The native file format of the treebank is based on XML. Greek letters are romanized using [[http://www.tlg.uci.edu/encoding/quickbeta.pdf|Beta Code]], a romanization scheme used widely not only in the Perseus project. It can be mapped 1-1 on the original Greek letters in UTF-8; however, embedded non-Greek words (such as the lemmas “comma” and “other”) cannot be identified automatically (and we do not want to decode them). |
| |
| Morphological annotation consists of lemma and nine-character positional morphosyntactic tag. Disambiguation has been done manually (gold standard). |
| |
| The syntactic annotation style is very similar to that of the Prague Dependency Treebank. The syntactic tags (analytical functions) are almost identical, too. However, in AGDT some combined values are permitted that are not valid in PDT, e.g. ''ATR_AP_ExD0_APOS''. |
| |
==== Sample ==== | ==== Sample ==== |
| |
The first two sentences of the corpus in its native XML format: | The first sentence of the corpus in its native XML format: |
| |
<code xml><treeset name="http://ranneliike.net/blogi.php?nick=Aboa Kirjoitettu: 02.02.2010, 15:41:06"> | <code xml><?xml version="1.0"?> |
<sentence txt="Kävelyreitti III"> | <treebank version="1.2" |
<token charOff="0-12"> | xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" |
<posreading CG="true" baseform="kävely#reitti" rawtags="N NOM SG <up>" /> | xmlns:treebank="http://nlp.perseus.tufts.edu/syntax/treebank/1.5" |
</token> | xsi:schemaLocation="http://nlp.perseus.tufts.edu/syntax/treebank/1.5 treebank-1.5.xsd" |
<token charOff="13-16"> | xml:lang="grc"> |
<posreading CG="true" baseform="III" rawtags="<roman> ABBR NOM SG <up>" /> | <date>Wed Sep 29 12:03:38 EDT 2010</date> |
<posreading CG="true" baseform="iii" rawtags="ABBR <up>" /> | <annotator> |
<posreading CG="true" baseform="iii" rawtags="<roman> ABBR NOM SG <up>" /> | <short>FrancescoM</short> |
</token> | <name>Francesco Mambrini</name> |
<dep dep="1" gov="0" type="num" /> | <address>Tufts University, Medford, MA, USA</address> |
</sentence> | </annotator> |
<sentence txt="Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin."> | <sentence id="2185285" document_id="Perseus:text:1999.01.0003" subdoc="card=1" span="qeou\s0:.0"> |
<token charOff="0-6"> | <annotator>FrancescoM</annotator> |
<posreading CG="true" baseform="jää" rawtags="N ADE SG <up>" /> | <word id="1" cid="32749174" form="qeou\s" lemma="qeo/s1" postag="n-p---ma-" head="3" relation="OBJ" /> |
</token> | <word id="2" cid="32749175" form="me\n" lemma="me/n1" postag="g--------" head="3" relation="AuxY" /> |
<token charOff="7-13"> | <word id="3" cid="32749176" form="ai)tw=" lemma="ai)te/w1" postag="v1spia---" head="0" relation="PRED" /> |
<posreading CG="true" baseform="kävely" rawtags="DV-U N NOM SG" /> | <word id="4" cid="32749177" form="tw=nd'" lemma="o(/de1" postag="p-p---mg-" head="6" relation="ATR" /> |
</token> | <word id="5" cid="32749178" form="a)pallagh\n" lemma="a)pallagh/1" postag="n-s---fa-" head="3" relation="OBJ" /> |
<token charOff="14-18"> | <word id="6" cid="32749179" form="po/nwn" lemma="po/nos1" postag="n-p---mg-" head="5" relation="ATR_AP_ExD0_APOS" /> |
<posreading CG="true" baseform="avata" rawtags="V PRES ACT SG3" /> | <word id="7" cid="32749180" form="froura=s" lemma="froura/1" postag="n-s---fg-" head="5" relation="ATR_AP_ExD0_APOS" /> |
<posreading CG="false" baseform="avata" rawtags="V PRES ACT NEG" /> | <word id="8" cid="32749181" form="e)tei/as" lemma="e)/teios1" postag="a-s---fg-" head="7" relation="ATR" /> |
<posreading CG="false" baseform="avata" rawtags="V IMPV ACT SG2" /> | <word id="9" cid="32749182" form="mh=kos" lemma="mh=kos1" postag="n-s---na-" head="8" relation="ATR" /> |
<posreading CG="false" baseform="avata" rawtags="V IMPV ACT NEG" /> | <word id="10" cid="32749183" form="," lemma="comma1" postag="u--------" head="21" relation="AuxX" /> |
</token> | <word id="11" cid="32749184" form="h(\n" lemma="o(/s1" postag="p-s---fa-" head="12" relation="OBJ" /> |
<token charOff="19-23"> | <word id="12" cid="32749185" form="koimw/menos" lemma="koima/w1" postag="t-sppemn-" head="21" relation="ADV" /> |
<posreading CG="true" baseform="aina" rawtags="ADV" /> | <word id="13" cid="32749186" form="ste/gais" lemma="ste/gh1" postag="n-p---fd-" head="12" relation="ADV" /> |
</token> | <word id="14" cid="32749187" form="*)atreidw=n" lemma="*)atrei/dhs1" postag="n-p---mg-" head="13" relation="ATR" /> |
<token charOff="24-32"> | <word id="15" cid="32749188" form="a)/gkaqen" lemma="a)/gkaqen1" postag="d--------" head="16" relation="ADV_AP" /> |
<posreading CG="true" baseform="hauska" rawtags="A POS PTV PL" /> | <word id="16" cid="32749189" form="," lemma="comma1" postag="u--------" head="12" relation="APOS" /> |
</token> | <word id="17" cid="32749190" form="kuno\s" lemma="ku/wn1" postag="n-s---mg-" head="18" relation="ATR" /> |
<token charOff="33-35"> | <word id="18" cid="32749191" form="di/khn" lemma="di/kh1" postag="n-s---fa-" head="16" relation="ADV_AP" /> |
<posreading CG="true" baseform="ja" rawtags="COORD C" /> | <word id="19" cid="32749192" form="," lemma="comma1" postag="u--------" head="16" relation="AuxX" /> |
</token> | <word id="20" cid="32749193" form="a)/strwn" lemma="a)/stron1" postag="n-p---ng-" head="23" relation="ATR" /> |
<token charOff="36-45"> | <word id="21" cid="32749194" form="ka/toida" lemma="ka/toida1" postag="v1sria---" head="7" relation="ATR" /> |
<posreading CG="true" baseform="erikoinen" rawtags="A POS PTV PL" /> | <word id="22" cid="32749195" form="nukte/rwn" lemma="nu/kteros1" postag="a-p---ng-" head="20" relation="ATR" /> |
</token> | <word id="23" cid="32749196" form="o(mh/gurin" lemma="o(mh/guris1" postag="n-s---fa-" head="25" relation="OBJ_AP_CO" /> |
<token charOff="46-56"> | <word id="24" cid="32749197" form="," lemma="comma1" postag="u--------" head="25" relation="AuxX" /> |
<posreading CG="true" baseform="näkö#kulma" rawtags="N PTV PL" /> | <word id="25" cid="32749198" form="kai\" lemma="kai/1" postag="c--------" head="38" relation="COORD" /> |
</token> | <word id="26" cid="32749199" form="tou\s" lemma="o(1" postag="l-p---ma-" head="33" relation="ATR" /> |
<token charOff="57-67"> | <word id="27" cid="32749200" form="fe/rontas" lemma="fe/rw1" postag="t-pppama-" head="33" relation="ATR" /> |
<posreading CG="true" baseform="kaupunki" rawtags="N ILL SG" /> | <word id="28" cid="32749201" form="xei=ma" lemma="xei=ma1" postag="n-s---na-" head="29" relation="OBJ_CO" /> |
</token> | <word id="29" cid="32749202" form="kai\" lemma="kai/1" postag="c--------" head="27" relation="COORD" /> |
<token charOff="67-68"> | <word id="30" cid="32749203" form="qe/ros" lemma="qe/ros1" postag="n-s---na-" head="29" relation="OBJ_CO" /> |
<posreading CG="true" baseform="." rawtags="PUNCT" /> | <word id="31" cid="32749204" form="brotoi=s" lemma="broto/s1" postag="n-p---md-" head="27" relation="OBJ" /> |
</token> | <word id="32" cid="32749205" form="lamprou\s" lemma="lampro/s1" postag="a-p---ma-" head="33" relation="ATR" /> |
<dep dep="0" gov="1" type="nommod" /> | <word id="33" cid="32749206" form="duna/stas" lemma="duna/sths1" postag="n-p---ma-" head="34" relation="OBJ_AP_CO" /> |
<dep dep="1" gov="2" type="nsubj" /> | <word id="34" cid="32749207" form="," lemma="comma1" postag="---------" head="25" relation="APOS" /> |
<dep dep="3" gov="2" type="advmod" /> | <word id="35" cid="32749208" form="e)mpre/pontas" lemma="e)mpre/pw1" postag="t-pppama-" head="37" relation="ATR" /> |
<dep dep="7" gov="2" type="dobj" /> | <word id="36" cid="32749209" form="ai)qe/ri" lemma="ai)qh/r1" postag="n-s---md-" head="35" relation="OBJ" /> |
<dep dep="9" gov="2" type="punct" /> | <word id="37" cid="32749210" form="[a)ste/ras" lemma="a)sth/r1" postag="n-p---ma-" head="34" relation="OBJ_AP_CO" /> |
<dep dep="5" gov="4" type="cc" /> | <word id="38" cid="32749211" form="," lemma="comma1" postag="---------" head="21" relation="APOS" /> |
<dep dep="6" gov="4" type="conj" /> | <word id="39" cid="32749212" form="o(/tan" lemma="o(/tan1" postag="c--------" head="43" relation="AuxC" /> |
<dep dep="4" gov="7" type="amod" /> | <word id="40" cid="32749213" form="fqi/nwsin" lemma="fqi/w1" postag="v3ppsa---" head="39" relation="OBJ_AP_CO" /> |
<dep dep="8" gov="7" type="nommod" /> | <word id="41" cid="32749214" form="," lemma="comma1" postag="---------" head="43" relation="AuxX" /> |
</sentence></code> | <word id="42" cid="32749215" form="a)ntola/s" lemma="a)natolh/1" postag="n-p---fa-" head="43" relation="OBJ_AP_CO" /> |
| <word id="43" cid="32749216" form="te" lemma="te1" postag="g--------" head="38" relation="COORD" /> |
| <word id="44" cid="32749217" form="tw=n]" lemma="o(" postag="p-p---mg-" head="42" relation="ATR" /> |
| <word id="45" cid="32749218" form="." lemma="other" postag="---------" head="0" relation="AuxK" /> |
| </sentence></code> |
| |
The same two sentences in the CoNLL format: | The same sentence converted to the CoNLL format, with Greek letters decoded: |
| |
| # b101.d.xml/1 |||||||||| | | 1 | ἄσημα | ἄσημος | a | a | pos=a<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=p<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=n<nowiki>|</nowiki>cas=a<nowiki>|</nowiki>deg=- | 6 | OBJ | _ | _ | |
| 1 | Kävelyreitti | kävely<nowiki>|</nowiki>reitti | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 0 | ROOT | _ | _ | | | 2 | δ’ | δέ1 | g | g | pos=g<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=-<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 7 | AuxY | _ | _ | |
| 2 | III | III | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | _ | 1 | num | _ | _ | | | 3 | αὐτῶν | αὐτός | a | a | pos=a<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=p<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=n<nowiki>|</nowiki>cas=g<nowiki>|</nowiki>deg=- | 1 | ATR | _ | _ | |
| |||||||||| | | 4 | αὐτίκ’ | αὐτίκα1 | d | d | pos=d<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=-<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 7 | ADV | _ | _ | |
| # b101.d.xml/2 |||||||||| | | 5 | ἀγνοίᾳ | ἄγνοια1 | n | n | pos=n<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=f<nowiki>|</nowiki>cas=d<nowiki>|</nowiki>deg=- | 6 | ADV | _ | _ | |
| 1 | Jäällä | jää | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | _ | 2 | nommod | _ | _ | | | 6 | λαβὼν | λαμβάνω1 | t | t | pos=t<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=a<nowiki>|</nowiki>mod=p<nowiki>|</nowiki>voi=a<nowiki>|</nowiki>gen=m<nowiki>|</nowiki>cas=n<nowiki>|</nowiki>deg=- | 7 | ADV | _ | _ | |
| 2 | kävely | kävely | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 3 | nsubj | _ | _ | | | 7 | ἔσθει | ἔσθω1 | v | v | pos=v<nowiki>|</nowiki>per=3<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=p<nowiki>|</nowiki>mod=i<nowiki>|</nowiki>voi=a<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 0 | PRED | _ | _ | |
| 3 | avaa | avata | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | _ | 0 | ROOT | _ | _ | | | 8 | βορὰν | βορά1 | n | n | pos=n<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=f<nowiki>|</nowiki>cas=a<nowiki>|</nowiki>deg=- | 7 | OBJ | _ | _ | |
| 4 | aina | aina | ADV | ADV | _ | 3 | advmod | _ | _ | | | 9 | ἄσωτον | ἄσωτος | a | a | pos=a<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=f<nowiki>|</nowiki>cas=a<nowiki>|</nowiki>deg=- | 8 | ATR | _ | _ | |
| 5 | hauskoja | hauska | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 8 | amod | _ | _ | | | 10 | , | comma1 | u | u | pos=u<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=-<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 11 | AuxX | _ | _ | |
| 6 | ja | ja | C<nowiki>|</nowiki>COORD | C<nowiki>|</nowiki>COORD | _ | 5 | cc | _ | _ | | | 11 | ὡς | ὡς | d | d | pos=d<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=-<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 9 | AuxC | _ | _ | |
| 7 | erikoisia | erikoinen | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 5 | conj | _ | _ | | | 12 | ὁρᾷς | ὁράω1 | v | v | pos=v<nowiki>|</nowiki>per=2<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=p<nowiki>|</nowiki>mod=i<nowiki>|</nowiki>voi=a<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 11 | ADV | _ | _ | |
| 8 | näkökulmia | näkö<nowiki>|</nowiki>kulma | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | _ | 3 | dobj | _ | _ | | | 13 | , | comma1 | u | u | pos=u<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=-<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 11 | AuxX | _ | _ | |
| 9 | kaupunkiin | kaupunki | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 8 | nommod | _ | _ | | | 14 | γένει | γένος | n | n | pos=n<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=s<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=n<nowiki>|</nowiki>cas=d<nowiki>|</nowiki>deg=- | 9 | ADV | _ | _ | |
| 10 | . | . | PUNCT | PUNCT | _ | 3 | punct | _ | _ | | | 15 | . | period1 | u | u | pos=u<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=-<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=-<nowiki>|</nowiki>cas=-<nowiki>|</nowiki>deg=- | 0 | AuxK | _ | _ | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in TDT are rare. Only 299 out of the 58576 tokens are attached nonprojectively (0.51%). | AGDT is an extremely nonprojective treebank, exceeding the nonprojectivity level found in other treebanks by an order of magnitude. 60469 out of the total 309,092 tokens are attached nonprojectively (19.56%). |
| |
I am not aware of any published evaluation of Finnish parsing accuracy. | I am not aware of any published evaluation of Ancient Greek parsing accuracy. |
| |