user:zeman:treebanks:grc [ufal wiki]

This is an old revision of the document!

Ancient Greek (grc)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Ancient Greek (grc)

Ancient Greek Dependency Treebank (AGDT)

Versions

AGDT 1.2

Obtaining and License

The AGDT is freely downloadable from here under the Creative Commons Attribution-NonCommercial-ShareAlike 2.5 license. The license in short:

non-commercial usage
modification and redistribution permitted under the same license
linking to the treebank website and citing the principal publication in publications required

AGDT was created by volunteering students and researchers from across the world. It is part of the Perseus Digital Library, a project on classical languages, hosted at the Tufts University, Medford, MA 02155, Massachusetts, USA.

References

Website
- http://nlp.perseus.tufts.edu/syntax/treebank/
Data
- no separate citation
Principal publications
- David Bamman, Gregory Crane: The Ancient Greek and Latin Dependency Treebanks. In: Caroline Sporleder, Antal van den Bosch, Kalliopi Zervanou (eds.): Language Technology for Cultural Heritage, ser. Foundations of Human Language Processing and Technology. Springer, Berlin / Heidelberg, Germany, 2011.
Documentation
- XML schema for the file format (also included in the distribution)
- David Bamman, Gregory Crane: Guidelines for the Syntactic Annotation of the Ancient Greek Dependency Treebank (1.1). Tufts University, Medford, Massachusetts, USA, 2008.
- Morphosyntactic tags are described in the README file.

Domain

Mixed (Wikipedia, Wikinews, university web-magazine and blogs).

Size

TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens / 430 sentences) for testing.

Inside

The native file format of the treebank is based on XML. Besides that, TDT is also distributed in the CoNLL-X format. The part-of-speech tag AND the morphosyntactic features are joined in one feature string, which is copied in both the CPOS and the POS columns of the CoNLL format. The FEAT column is empty (i.e. it contains the underscore character). Lemmas are available, too. Morphological annotation and disambiguation is automatic, it is no gold standard. The native XML format shows all morphological readings of every word based on the lexicon, and the disambiguation is left upon the user.

Sample

The first two sentences of the corpus in its native XML format:

<treeset name="http://ranneliike.net/blogi.php?nick=Aboa Kirjoitettu: 02.02.2010, 15:41:06">
  <sentence txt="Kävelyreitti III">
    <token charOff="0-12">
      <posreading CG="true" baseform="kävely#reitti" rawtags="N NOM SG &lt;up&gt;" />
    </token>
    <token charOff="13-16">
      <posreading CG="true" baseform="III" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" />
      <posreading CG="true" baseform="iii" rawtags="ABBR &lt;up&gt;" />
      <posreading CG="true" baseform="iii" rawtags="&lt;roman&gt; ABBR NOM SG &lt;up&gt;" />
    </token>
    <dep dep="1" gov="0" type="num" />
  </sentence>
  <sentence txt="Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin.">
    <token charOff="0-6">
      <posreading CG="true" baseform="jää" rawtags="N ADE SG &lt;up&gt;" />
    </token>
    <token charOff="7-13">
      <posreading CG="true" baseform="kävely" rawtags="DV-U N NOM SG" />
    </token>
    <token charOff="14-18">
      <posreading CG="true" baseform="avata" rawtags="V PRES ACT SG3" />
      <posreading CG="false" baseform="avata" rawtags="V PRES ACT NEG" />
      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT SG2" />
      <posreading CG="false" baseform="avata" rawtags="V IMPV ACT NEG" />
    </token>
    <token charOff="19-23">
      <posreading CG="true" baseform="aina" rawtags="ADV" />
    </token>
    <token charOff="24-32">
      <posreading CG="true" baseform="hauska" rawtags="A POS PTV PL" />
    </token>
    <token charOff="33-35">
      <posreading CG="true" baseform="ja" rawtags="COORD C" />
    </token>
    <token charOff="36-45">
      <posreading CG="true" baseform="erikoinen" rawtags="A POS PTV PL" />
    </token>
    <token charOff="46-56">
      <posreading CG="true" baseform="näkö#kulma" rawtags="N PTV PL" />
    </token>
    <token charOff="57-67">
      <posreading CG="true" baseform="kaupunki" rawtags="N ILL SG" />
    </token>
    <token charOff="67-68">
      <posreading CG="true" baseform="." rawtags="PUNCT" />
    </token>
    <dep dep="0" gov="1" type="nommod" />
    <dep dep="1" gov="2" type="nsubj" />
    <dep dep="3" gov="2" type="advmod" />
    <dep dep="7" gov="2" type="dobj" />
    <dep dep="9" gov="2" type="punct" />
    <dep dep="5" gov="4" type="cc" />
    <dep dep="6" gov="4" type="conj" />
    <dep dep="4" gov="7" type="amod" />
    <dep dep="8" gov="7" type="nommod" />
  </sentence>

The same two sentences in the CoNLL format:

# b101.d.xml/1
1	Kävelyreitti	kävely\|reitti	NOM\|up\|SG\|N	NOM\|up\|SG\|N	_	0	ROOT	_	_
2	III	III	roman\|NOM\|up\|SG\|ABBR	roman\|NOM\|up\|SG\|ABBR	_	1	num	_	_

# b101.d.xml/2
1	Jäällä	jää	ADE\|SG\|up\|N	ADE\|SG\|up\|N	_	2	nommod	_	_
2	kävely	kävely	DV-U\|NOM\|SG\|N	DV-U\|NOM\|SG\|N	_	3	nsubj	_	_
3	avaa	avata	SG3\|ACT\|PRES\|V	SG3\|ACT\|PRES\|V	_	0	ROOT	_	_
4	aina	aina	ADV	ADV	_	3	advmod	_	_
5	hauskoja	hauska	A\|PTV\|POS\|PL	A\|PTV\|POS\|PL	_	8	amod	_	_
6	ja	ja	C\|COORD	C\|COORD	_	5	cc	_	_
7	erikoisia	erikoinen	A\|PTV\|POS\|PL	A\|PTV\|POS\|PL	_	5	conj	_	_
8	näkökulmia	näkö\|kulma	PTV\|PL\|N	PTV\|PL\|N	_	3	dobj	_	_
9	kaupunkiin	kaupunki	ILL\|SG\|N	ILL\|SG\|N	_	8	nommod	_	_
10	.	.	PUNCT	PUNCT	_	3	punct	_	_

Parsing

Nonprojectivities in TDT are rare. Only 299 out of the 58576 tokens are attached nonprojectively (0.51%).

I am not aware of any published evaluation of Finnish parsing accuracy.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents