[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Last revision Both sides next revision
user:zeman:treebanks:et [2011/11/21 13:44]
zeman Inside, sample and parsing.
user:zeman:treebanks:et [2011/11/28 09:48]
zeman Size and nonprojectivity.
Line 37: Line 37:
 ==== Size ==== ==== Size ====
  
-All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined.+All four parts of the treebank together contain 9491 tokens in 1315 sentences, yielding 7.22 tokens per sentence on average. No official training-test data split is defined. ​Due to the small size of the treebank and extraordinary domain diversity, a good test set should sample from all four parts of the treebank. 
 + 
 +^ File ^ Sentences ^ Terminals ^ Average t/s ^ 
 +| arborest.xml |  175 |  2451 |  14.01 | 
 +| piialaused.xml |  732 |  4505 |  6.15 | 
 +| ratsepalaused.xml |  388 |  2348 |  6.05 | 
 +| sul.xml |  20 |  187 |  9.35 | 
 +| **total** |  1315 |  9491 |  7.22 |
  
 ==== Inside ==== ==== Inside ====
Line 44: Line 51:
  
 The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually. The annotation contains lemmas, part of speech tags, morphosyntactic features, nonterminal labels and phrase structure. It is not clear whether (and to what degree) the annotation was performed or checked manually.
 +
 +Note that the TIGER-XML format, despite being phrase-based,​ stores word order separately from structure and thus allows for nonprojectivities.
  
 ==== Sample ==== ==== Sample ====
Line 83: Line 92:
 ==== Parsing ==== ==== Parsing ====
  
-The phrase structure is projective by definition.+Nonprojectivities in EKP are very rare. Only 7 out of the 9491 tokens are attached nonprojectively (0.074%).
  
 There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser. There is a constraint grammar parser for Estonian by Kaili Müürisep. I am not aware of any published evaluation of parsing accuracy. However, I am not sure that the treebank described here is not just output of the parser.
  

[ Back to the navigation ] [ Back to the content ]