[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Last revision Both sides next revision
user:zeman:treebanks:ja [2012/01/04 09:34]
zeman Sample.
user:zeman:treebanks:ja [2012/01/04 09:54]
zeman Parsing.
Line 42: Line 42:
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotationUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset.+The text has been romanized and the original characters (kanji + kanaare not available. There should be a 1-1 mapping between the romanized text (rōmaji) and the Japanese script of hiraganaThere is no indication though where katakana or kanji are preferred over hiragana.
  
-The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. +The morphological analysis does not include lemmas. The part-of-speech tags have been assigned (probably) manually. Few morphosyntactic features are used.
- +
-Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.+
  
 ==== Sample ==== ==== Sample ====
Line 104: Line 102:
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).+Nonprojectivities in TüBa-J/are not frequent. Only 1736 of the 157,172 tokens in the CoNLL 2006 version are attached nonprojectively (1.1%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Japanese:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.) | 84.79 90.58 +Basis (John O'Neil) | 90.57 93.16 
-Malt (Nivre et al.) | 84.77 89.80 +Nara (Yuchang Cheng) | 89.91 93.12 
-Riedel et al. | 83.63 89.66 |+Malt (Nivre et al.91.65 93.10 |
  

[ Back to the navigation ] [ Back to the content ]