[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:tr [2012/03/22 20:43]
zeman ODTÜ-Sabancı Türkçe Ağaç Yapılı Derlemi
user:zeman:treebanks:tr [2012/03/22 20:52]
zeman Size.
Line 35: Line 35:
 ==== Domain ==== ==== Domain ====
  
-Mixed: +Post-1990 written Turkishsampled from various genres.
-  * Fiction +
-  * Short essays by 14 to 16 year-old students +
-  * Newspapers (NépszabadságNépszava, Magyar Hírlap, HVG) +
-  * Texts related to computer science +
-  * Legal texts +
-  * Economic and financial short news+
  
 ==== Size ==== ==== Size ====
  
-According to their website, SzTB 2.0 contains 1.2 million words plus 250 thousand punctuation tokens in 82000 sentences. Only a fragment was converted to dependencies in the CoNLL 2007 version: 139,143 tokens in 6424 sentences, yielding 21.66 tokens per sentence on average (131,799 tokens / 6034 sentences training, 7344 tokens / 390 sentences test).+According to their website, the treebank contains 7262 sentences. The CoNLL 2007 version contains 69695 tokens in 5935 sentences, yielding 11.74 tokens per sentence on average (65182 tokens / 5635 sentences training, 4513 tokens / 300 sentences test).
  
 ==== Inside ==== ==== Inside ====

[ Back to the navigation ] [ Back to the content ]