Both sides previous revision
Previous revision
Next revision
|
Previous revision
|
user:zeman:treebanks:fi [2011/12/05 14:11] zeman References. |
user:zeman:treebanks:fi [2011/12/05 15:37] (current) zeman Inside and parsing. |
==== Domain ==== | ==== Domain ==== |
| |
Mixed: | Mixed (Wikipedia, Wikinews, university web-magazine and blogs). |
* 388 tailored sentences with movement verbs | |
* 732 sentences with movement verbs from the Estonian FrameNet corpus | |
* 175 sentences from the Arborest corpus | |
* 20 sentences of spoken language | |
| |
==== Size ==== | ==== Size ==== |
| |
All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. This is the case of our HamleDT experimental data split, shown in the last two rows of the table. | TDT contains 58576 tokens in 4307 sentences, yielding 13.60 tokens per sentence on average. No official training-test data split is defined. For our HamleDT experiments, we took the first 90 % (53151 tokens / 3877 sentences) for training and the remaining 10 % (5425 tokens / 430 sentences) for testing. |
| |
^ File ^ Sentences ^ Terminals ^ Average t/s ^ | |
| arborest.xml | 175 | 2451 | 14.01 | | |
| piialaused.xml | 732 | 4505 | 6.15 | | |
| ratsepalaused.xml | 388 | 2348 | 6.05 | | |
| sul.xml | 20 | 187 | 9.35 | | |
| **total** | **1315** | **9491** | **7.22** | | |
| training | 1184 | 8535 | 7.21 | | |
| test | 131 | 956 | 7.30 | | |
| |
==== Inside ==== | ==== Inside ==== |
| |
The treebank is part of the [[http://corp.hum.sdu.dk/tgrepeye_est.html|Arborest]] project and [[http://beta.visl.sdu.dk/|VISL]] (Visual Interactive Syntax Learning). As such, it is based on Constraint Grammar (Fred Karlsson et al., 1995: Constraint Grammar – A Language-Independent System for Parsing Unrestricted Text. Mouton de Gruyter). All four parts are available in the [[http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERSearch/doc/html/TigerXML.html|TIGER-XML]] format. Two of them are also available in the [[http://beta.visl.sdu.dk/treebanks.html#The_source_format|VISL]] format. | The native file format of the treebank is based on XML. Besides that, TDT is also distributed in the [[:format-conll|CoNLL-X format]]. The part-of-speech tag AND the morphosyntactic features are joined in one feature string, which is copied in both the CPOS and the POS columns of the CoNLL format. The FEAT column is empty (i.e. it contains the underscore character). Lemmas are available, too. Morphological annotation and disambiguation is automatic, it is no gold standard. The native XML format shows all morphological readings of every word based on the lexicon, and the disambiguation is left upon the user. |
| |
The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually. | ==== Sample ==== |
| |
Note that the TIGER-XML format, despite being phrase-based, stores word order separately from structure and thus allows for nonprojectivities. | The first two sentences of the corpus in its native XML format: |
| |
==== Sample ==== | |
| |
The first sentence of the corpus in the TIGER-XML format: | <code xml><treeset name="http://ranneliike.net/blogi.php?nick=Aboa Kirjoitettu: 02.02.2010, 15:41:06"> |
| <sentence txt="Kävelyreitti III"> |
| <token charOff="0-12"> |
| <posreading CG="true" baseform="kävely#reitti" rawtags="N NOM SG <up>" /> |
| </token> |
| <token charOff="13-16"> |
| <posreading CG="true" baseform="III" rawtags="<roman> ABBR NOM SG <up>" /> |
| <posreading CG="true" baseform="iii" rawtags="ABBR <up>" /> |
| <posreading CG="true" baseform="iii" rawtags="<roman> ABBR NOM SG <up>" /> |
| </token> |
| <dep dep="1" gov="0" type="num" /> |
| </sentence> |
| <sentence txt="Jäällä kävely avaa aina hauskoja ja erikoisia näkökulmia kaupunkiin."> |
| <token charOff="0-6"> |
| <posreading CG="true" baseform="jää" rawtags="N ADE SG <up>" /> |
| </token> |
| <token charOff="7-13"> |
| <posreading CG="true" baseform="kävely" rawtags="DV-U N NOM SG" /> |
| </token> |
| <token charOff="14-18"> |
| <posreading CG="true" baseform="avata" rawtags="V PRES ACT SG3" /> |
| <posreading CG="false" baseform="avata" rawtags="V PRES ACT NEG" /> |
| <posreading CG="false" baseform="avata" rawtags="V IMPV ACT SG2" /> |
| <posreading CG="false" baseform="avata" rawtags="V IMPV ACT NEG" /> |
| </token> |
| <token charOff="19-23"> |
| <posreading CG="true" baseform="aina" rawtags="ADV" /> |
| </token> |
| <token charOff="24-32"> |
| <posreading CG="true" baseform="hauska" rawtags="A POS PTV PL" /> |
| </token> |
| <token charOff="33-35"> |
| <posreading CG="true" baseform="ja" rawtags="COORD C" /> |
| </token> |
| <token charOff="36-45"> |
| <posreading CG="true" baseform="erikoinen" rawtags="A POS PTV PL" /> |
| </token> |
| <token charOff="46-56"> |
| <posreading CG="true" baseform="näkö#kulma" rawtags="N PTV PL" /> |
| </token> |
| <token charOff="57-67"> |
| <posreading CG="true" baseform="kaupunki" rawtags="N ILL SG" /> |
| </token> |
| <token charOff="67-68"> |
| <posreading CG="true" baseform="." rawtags="PUNCT" /> |
| </token> |
| <dep dep="0" gov="1" type="nommod" /> |
| <dep dep="1" gov="2" type="nsubj" /> |
| <dep dep="3" gov="2" type="advmod" /> |
| <dep dep="7" gov="2" type="dobj" /> |
| <dep dep="9" gov="2" type="punct" /> |
| <dep dep="5" gov="4" type="cc" /> |
| <dep dep="6" gov="4" type="conj" /> |
| <dep dep="4" gov="7" type="amod" /> |
| <dep dep="8" gov="7" type="nommod" /> |
| </sentence></code> |
| |
<code xml><s id="ratsep-13" ref="ratsep-1" source="id=ratsep-1" forest="1/1" text="Peeter aerutas üle väina saarele puhkama"> | The same two sentences in the CoNLL format: |
<graph root="ratsep-13_501"> | |
<terminals> | |
<t id="ratsep-13_1" word="Peeter" lemma="Peeter+0" pos="prop" morph="prop,sg,nom,.cap"/> | |
<t id="ratsep-13_2" word="aerutas" lemma="aeruta+s" pos="v-fin" morph="main,indic,impf,ps3,sg,ps,af,.FinV"/> | |
<t id="ratsep-13_3" word="üle" lemma="üle+0" pos="prp" morph="pre,.gen"/> | |
<t id="ratsep-13_4" word="väina" lemma="väin+0" pos="n" morph="com,sg,gen"/> | |
<t id="ratsep-13_5" word="saarele" lemma="saar+le" pos="n" morph="com,sg,all"/> | |
<t id="ratsep-13_6" word="puhkama" lemma="puhka+ma" pos="v-inf" morph="main,sup,ps,ill,.Part"/> | |
<t id="ratsep-13_7" word="." lemma="." pos="punc" morph="Fst"/> | |
</terminals> | |
| |
<nonterminals> | | # b101.d.xml/1 |||||||||| |
<nt id="ratsep-13_501" cat="VROOT"> | | 1 | Kävelyreitti | kävely<nowiki>|</nowiki>reitti | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 0 | ROOT | _ | _ | |
<edge label="STA" idref="ratsep-13_502"/> | | 2 | III | III | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | roman<nowiki>|</nowiki>NOM<nowiki>|</nowiki>up<nowiki>|</nowiki>SG<nowiki>|</nowiki>ABBR | _ | 1 | num | _ | _ | |
</nt> | | |||||||||| |
<nt id="ratsep-13_502" cat="fcl"> | | # b101.d.xml/2 |||||||||| |
<edge label="S" idref="ratsep-13_1"/> | | 1 | Jäällä | jää | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | ADE<nowiki>|</nowiki>SG<nowiki>|</nowiki>up<nowiki>|</nowiki>N | _ | 2 | nommod | _ | _ | |
<edge label="P" idref="ratsep-13_2"/> | | 2 | kävely | kävely | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | DV-U<nowiki>|</nowiki>NOM<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 3 | nsubj | _ | _ | |
<edge label="A" idref="ratsep-13_503"/> | | 3 | avaa | avata | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | SG3<nowiki>|</nowiki>ACT<nowiki>|</nowiki>PRES<nowiki>|</nowiki>V | _ | 0 | ROOT | _ | _ | |
<edge label="A" idref="ratsep-13_5"/> | | 4 | aina | aina | ADV | ADV | _ | 3 | advmod | _ | _ | |
<edge label="A" idref="ratsep-13_6"/> | | 5 | hauskoja | hauska | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 8 | amod | _ | _ | |
<edge label="FST" idref="ratsep-13_7"/> | | 6 | ja | ja | C<nowiki>|</nowiki>COORD | C<nowiki>|</nowiki>COORD | _ | 5 | cc | _ | _ | |
</nt> | | 7 | erikoisia | erikoinen | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | A<nowiki>|</nowiki>PTV<nowiki>|</nowiki>POS<nowiki>|</nowiki>PL | _ | 5 | conj | _ | _ | |
<nt id="ratsep-13_503" cat="pp"> | | 8 | näkökulmia | näkö<nowiki>|</nowiki>kulma | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | PTV<nowiki>|</nowiki>PL<nowiki>|</nowiki>N | _ | 3 | dobj | _ | _ | |
<edge label="H" idref="ratsep-13_3"/> | | 9 | kaupunkiin | kaupunki | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | ILL<nowiki>|</nowiki>SG<nowiki>|</nowiki>N | _ | 8 | nommod | _ | _ | |
<edge label="D" idref="ratsep-13_4"/> | | 10 | . | . | PUNCT | PUNCT | _ | 3 | punct | _ | _ | |
</nt> | |
</nonterminals> | |
</graph> | |
</s></code> | |
| |
==== Parsing ==== | ==== Parsing ==== |
| |
Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%). | Nonprojectivities in TDT are rare. Only 299 out of the 58576 tokens are attached nonprojectively (0.51%). |
| |
There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser. | I am not aware of any published evaluation of Finnish parsing accuracy. |
| |