Both sides previous revision
Previous revision
Next revision
|
Previous revision
Last revision
Both sides next revision
|
user:zeman:treebanks:grc [2011/12/06 15:00] zeman Inside, sample and parsing. |
user:zeman:treebanks:grc [2011/12/06 16:02] zeman Conversion run newly, empty sentences discarded, both sentence and token counts differ. |
==== Size ==== | ==== Size ==== |
| |
AGDT contains 309,092 tokens in 21165 sentences, yielding 14.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the smallest file called ''1999.01.0015.xml'' (5949 tokens / 529 sentences; Aeschylus: //Suppliants//) for testing and the rest (303,143 tokens / 20636 sentences) for training. | AGDT contains 308,882 tokens in 21160 non-empty sentences, yielding 14.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the smallest file called ''1999.01.0015.xml'' (5925 tokens / 528 sentences; Aeschylus: //Suppliants//) for testing and the rest (302,957 tokens / 20632 sentences) for training. |
| |
==== Inside ==== | ==== Inside ==== |
The native file format of the treebank is based on XML. Greek letters are romanized using [[http://www.tlg.uci.edu/encoding/quickbeta.pdf|Beta Code]], a romanization scheme used widely not only in the Perseus project. It can be mapped 1-1 on the original Greek letters in UTF-8; however, embedded non-Greek words (such as the lemmas “comma” and “other”) cannot be identified automatically (and we do not want to decode them). | The native file format of the treebank is based on XML. Greek letters are romanized using [[http://www.tlg.uci.edu/encoding/quickbeta.pdf|Beta Code]], a romanization scheme used widely not only in the Perseus project. It can be mapped 1-1 on the original Greek letters in UTF-8; however, embedded non-Greek words (such as the lemmas “comma” and “other”) cannot be identified automatically (and we do not want to decode them). |
| |
Morphological annotation consists of lemma and nine-character positional morphosyntactic tags. Disambiguation has been done manually (gold standard). | Morphological annotation consists of lemma and nine-character positional morphosyntactic tag. Disambiguation has been done manually (gold standard). |
| |
The syntactic annotation style is very similar to that of the Prague Dependency Treebank. The syntactic tags (analytical functions) are almost identical, too. However, in AGDT some combined values are permitted that are not valid in PDT, e.g. ''ATR_AP_ExD0_APOS''. | The syntactic annotation style is very similar to that of the Prague Dependency Treebank. The syntactic tags (analytical functions) are almost identical, too. However, in AGDT some combined values are permitted that are not valid in PDT, e.g. ''ATR_AP_ExD0_APOS''. |
</sentence></code> | </sentence></code> |
| |
The same sentence converted to the CoNLL format, with Greek letters decoded: | The first sentence of the corpus converted to the CoNLL format, with Greek letters decoded (note that this is not the same sentence as above because the conversion script reorders sentences according to their sentence id): |
| |
| 1 | ἄσημα | ἄσημος | a | a | pos=a<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=p<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=n<nowiki>|</nowiki>cas=a<nowiki>|</nowiki>deg=- | 6 | OBJ | _ | _ | | | 1 | ἄσημα | ἄσημος | a | a | pos=a<nowiki>|</nowiki>per=-<nowiki>|</nowiki>num=p<nowiki>|</nowiki>ten=-<nowiki>|</nowiki>mod=-<nowiki>|</nowiki>voi=-<nowiki>|</nowiki>gen=n<nowiki>|</nowiki>cas=a<nowiki>|</nowiki>deg=- | 6 | OBJ | _ | _ | |