[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:ja [2012/01/04 09:34]
zeman Sample.
user:zeman:treebanks:ja [2014/04/22 16:49] (current)
zeman Updated link.
Line 1: Line 1:
 ===== Japanese (ja) ===== ===== Japanese (ja) =====
  
-[[http://www.sfs.uni-tuebingen.de/en/tuebajs.shtml|Tübingen Treebank of Spoken Japanese]] (TüBa-J/S, Verbmobil project)+[[http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-js.html|Tübingen Treebank of Spoken Japanese]] (TüBa-J/S, Verbmobil project)
  
 ==== Versions ==== ==== Versions ====
Line 42: Line 42:
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotationUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset.+The text has been romanized and the original characters (kanji + kanaare not available. There should be a 1-1 mapping between the romanized text (rōmaji) and the Japanese script of hiraganaThere is no indication though where katakana or kanji are preferred over hiragana.
  
-The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. +The morphological analysis does not include lemmas. The part-of-speech tags have been assigned (probably) manually. Few morphosyntactic features are used.
- +
-Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.+
  
 ==== Sample ==== ==== Sample ====
Line 104: Line 102:
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).+Nonprojectivities in TüBa-J/are not frequent. Only 1736 of the 157,172 tokens in the CoNLL 2006 version are attached nonprojectively (1.1%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Japanese:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.) | 84.79 90.58 +Basis (John O'Neil) | 90.57 93.16 
-Malt (Nivre et al.) | 84.77 89.80 +Nara (Yuchang Cheng) | 89.91 93.12 
-Riedel et al. | 83.63 89.66 |+Malt (Nivre et al.91.65 93.10 |
  

[ Back to the navigation ] [ Back to the content ]