[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
user:zeman:treebanks:ja [2012/01/03 23:42]
zeman References.
user:zeman:treebanks:ja [2014/04/22 16:49] (current)
zeman Updated link.
Line 1: Line 1:
 ===== Japanese (ja) ===== ===== Japanese (ja) =====
  
-[[http://www.sfs.uni-tuebingen.de/en/tuebajs.shtml|Tübingen Treebank of Spoken Japanese]] (TüBa-J/S, Verbmobil project)+[[http://www.sfs.uni-tuebingen.de/en/ascl/resources/corpora/tueba-js.html|Tübingen Treebank of Spoken Japanese]] (TüBa-J/S, Verbmobil project)
  
 ==== Versions ==== ==== Versions ====
Line 34: Line 34:
 ==== Domain ==== ==== Domain ====
  
-Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from wide range of randomly selected linguistically representative Danish texts from 1983-1992.”)+Spoken dialogues, negotiations about time and place of business meetings. That is why many sentences are relatively short (a frequent single-word sentence is //hai// = “yes”).
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test).+The CoNLL 2006 version contains 157,172 tokens in 17753 sentences, yielding 8.85 tokens per sentence on average (CoNLL 2006 data split: 151,461 tokens / 17044 sentences training, 5711 tokens / 709 sentences test).
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotationUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset.+The text has been romanized and the original characters (kanji + kanaare not available. There should be a 1-1 mapping between the romanized text (rōmaji) and the Japanese script of hiraganaThere is no indication though where katakana or kanji are preferred over hiragana.
  
-The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. +The morphological analysis does not include lemmas. The part-of-speech tags have been assigned (probably) manually. Few morphosyntactic features are used.
- +
-Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.+
  
 ==== Sample ==== ==== Sample ====
  
-The first sentence of DDT 1.0 in the DTAG format: +The first three sentences of the CoNLL 2006 training data:
- +
-<code xml><tei.2> +
-  <teiHeader type=text> +
-    <fileDesc> +
-      <titleStmt> +
-        <title>Tagged sample of: 'Jeltsins skæbnetime'</title> +
-      </titleStmt> +
-      <extent words=158>158 running words</extent> +
-      <publicationStmt> +
-         <distributor>PAROLE-DK</distributor> +
-         <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> +
-         <date>1998-06-02</date> +
-         <availability status=restricted><p>by agreement with distributor</availability> +
-      </publicationStmt> +
-      <sourceDesc> +
-        <biblStruct> +
-          <analytic> +
-            <title>Jeltsins skæbnetime</title> +
-            <author gender=m born=1925>Nikulin, Leon</author> +
-          </analytic> +
-          <monogr> +
-            <imprint><pubPlace>Denmark</pubPlace> +
-              <publisher>Det Fri Aktuelt</publisher> +
-              <date>1992-12-01</date> +
-            </imprint> +
-          </monogr> +
-        </biblStruct> +
-      </sourceDesc> +
-    </fileDesc> +
-    <profileDesc> +
-      <creation>1992-12-01</creation> +
-      <langUsage><language>Danish</langUsage> +
-      <textClass> +
-        <catRef target="P.M2"> +
-        <catRef target="P.G4.8"> +
-        <catRef target="P.T9.3"> +
-      </textClass> +
-    </profileDesc> +
-  </teiHeader> +
-<text id=AJK> +
-<body> +
-<div1 type=main> +
-<p> +
-<s> +
-<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> +
-<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> +
-<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> +
-<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> +
-<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> +
-<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> +
-<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> +
-<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> +
-<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> +
-<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> +
-<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> +
-<W lemma="," msd="XP" in="-2:pnct" out="">,</W> +
-<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> +
-<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> +
-<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> +
-<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> +
-<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> +
-<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> +
-<W lemma="&quot;" msd="XP" in="1:pnct" out="">"</W> +
-<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> +
-<W lemma="&quot;" msd="XP" in="-1:pnct" out="">"</W> +
-<W lemma="." msd="XP" in="-12:pnct" out="">.</W> +
-</s></code> +
- +
-The first sentence of the CoNLL 2006 training data:+
  
-| 1 | Samme _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | +| 1 | kasahara | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 2 | HD | 2 | HD | 
-cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef nobj +| 2 | arisa | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-XP pnct +| 3 | desu | <nowiki>_</nowiki> | PV | PVfin | u | 0 | ROOT | ROOT 
-de _ | P | PD | gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked subj +<nowiki>.</nowiki> <nowiki>_</nowiki> | <nowiki>.</nowiki> | <nowiki>.</nowiki> | <nowiki>_</nowiki>PUNCT PUNCT 
-5 | norske | _ | A AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod | +| |||||||||| 
-piger NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef 4 | nobj | _ | _ | +kadowaki | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki>HD HD 
-| 7 | tabte | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active rel +masakazu | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-med SP SP pobj +| 3 | desu | <nowiki>_</nowiki>PV PVfin ROOT ROOT | 
-i_lørdags | _ | RG RG degree=unmarked mod +| <nowiki>.</nowiki> <nowiki>_</nowiki> <nowiki>.</nowiki> | <nowiki>.</nowiki> <nowiki>_</nowiki>PUNCT PUNCT 
-10 mod | _ | SP SP | _ | pobj +| |||||||||| 
-11 VMs | _ | N | NP case=gen 10 nobj +kadowaki <nowiki>_</nowiki> NAME NAMEper <nowiki>_</nowiki> COMP COMP 
-12 værtsnation | _ | NC gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | +saN <nowiki>_</nowiki> PNsf <nowiki>_</nowiki> ROOT ROOT 
-| 13 | . | _ | XP 1 | pnct | _ | _ |+omatase <nowiki>_</nowiki> | N | VN <nowiki>_</nowiki> COMP COMP 
 +shimashita <nowiki>_</nowiki> VS VSfin | ta | 0 | ROOT | 0 | ROOT | 
 +| 5 | <nowiki>.</nowiki> <nowiki>_</nowiki> <nowiki>.</nowiki><nowiki>.</nowiki> <nowiki>_</nowiki> PUNCT PUNCT |
  
-The first sentence of the CoNLL 2006 test data:+The first three sentences of the CoNLL 2006 test data:
  
-| 1 | To | _ | AC case=unmarked 10 subj +| 1 | tashiro <nowiki>_</nowiki> NAME NAMEper <nowiki>_</nowiki> HD HD 
-| 2 | kendte | _ | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod +| 2 | yasuko <nowiki>_</nowiki> NAME NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-russiske AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod | +| 3 | desu | <nowiki>_</nowiki> | PV | PVfin | u | 0 | ROOT | 0 | ROOT | 
-historikere NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef nobj +| 4 | <nowiki>.</nowiki> <nowiki>_</nowiki> | <nowiki>.</nowiki> | <nowiki>.</nowiki> | <nowiki>_</nowiki>PUNCT PUNCT 
-Andronik NP case=unmarked namef +| |||||||||| 
-Mirganjan | _ | NP case=unmarked appr +| 1 | hayakawa | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 2 | HD | 2 | HD | 
-og | _ | CC | _ | 6 | coord +| 2 | ryou | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-Igor | _ | N | NP case=unmarked namef +| 3 | desu | <nowiki>_</nowiki>PV PVfin ROOT ROOT | 
-Klamkin | _ | NP case=unmarked conj +| <nowiki>.</nowiki> <nowiki>_</nowiki> | <nowiki>.</nowiki> <nowiki>.</nowiki> | <nowiki>_</nowiki>PUNCT PUNCT 
-10 tror | _ | VA mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active ROOT | _ | +| |||||||||| 
-11 ikke | _ | RG RG degree=unmarked 10 mod +hayakawa <nowiki>_</nowiki> NAME NAMEper <nowiki>_</nowiki> COMP COMP 
-12 | _ | XP | _ | 10 pnct +saN <nowiki>_</nowiki> PNsf <nowiki>_</nowiki> | 6 | ADJ ADJ 
-13 at | _ | CS | _ | 10 dobj +ima <nowiki>_</nowiki> | N | Ntmp <nowiki>_</nowiki> ADJ ADJ 
-14 Rusland | _ | NP case=unmarked 15 subj +chotto <nowiki>_</nowiki> ADV ADV <nowiki>_</nowiki> ADJ ADJ 
-15 kan | _ | VA mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 13 | vobj +ojikaN <nowiki>_</nowiki> NN | <nowiki>_</nowiki> | 6 | SBJ | 6 | SBJ | 
-16 udvikles | _ | VA mood=infin<nowiki>|</nowiki>voice=passive 15 vobj +| 6 | yoroshii | <nowiki>_</nowiki>ADJ ADJifin <nowiki>_</nowiki> 7 | COMP | 7 | COMP 
-17 uden | _ | SP SP | _ | 15 | mod +desu <nowiki>_</nowiki> PV PVfin ROOT ROOT 
-18 en | _ | P | PI gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 17 | nobj +ka <nowiki>_</nowiki> PS PSE <nowiki>_</nowiki> MRK MRK 
-19 | _ | XP | _ | 20 pnct +chotto <nowiki>_</nowiki> ADV ADV <nowiki>_</nowiki> 16 ADJ 16 ADJ 
-20 jernnæve _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef 18 nobj +10 doitsu <nowiki>_</nowiki> NAME NAMEloc <nowiki>_</nowiki> 11 COMP 11 COMP 
-21 | _ | XP | _ | 20 pnct +11 no <nowiki>_</nowiki> Pgen | <nowiki>_</nowiki> | 12 | ADJ | 12 | ADJ | 
-22 | . | _ | XP | _ | 10 pnct |+| 12 | shucchou | <nowiki>_</nowiki> | N | VN | <nowiki>_</nowiki> | 13 | COMP 13 COMP 
 +13 no <nowiki>_</nowiki> Pgen | <nowiki>_</nowiki>14 COMP 14 COMP 
 +14 koto <nowiki>_</nowiki> NF <nowiki>_</nowiki> | 15 | COMP 15 COMP 
 +15 de <nowiki>_</nowiki> | P | | <nowiki>_</nowiki> | 16 | ADJ | 16 | ADJ | 
 +| 16 | gosoudaN | <nowiki>_</nowiki> | N | VN | <nowiki>_</nowiki> | 17 | COMP 17 COMP 
 +17 shitai <nowiki>_</nowiki> VADJ VADJi <nowiki>_</nowiki> 18 COMP 18 COMP 
 +18 no | <nowiki>_</nowiki> | N | NF | <nowiki>_</nowiki> | 19 | COMP | 19 | COMP | 
 +| 19 | desu | <nowiki>_</nowiki>PV PVfin 20 | COMP | 20 | COMP 
 +20 ga <nowiki>_</nowiki> PS PSSb <nowiki>_</nowiki> ROOT ROOT 
 +21 <nowiki>.</nowiki> <nowiki>_</nowiki> <nowiki>.</nowiki> <nowiki>.</nowiki> <nowiki>_</nowiki> 20 PUNCT 20 PUNCT |
  
 ==== Parsing ==== ==== Parsing ====
  
-Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).+Nonprojectivities in TüBa-J/are not frequent. Only 1736 of the 157,172 tokens in the CoNLL 2006 version are attached nonprojectively (1.1%).
  
-The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:+The results of the CoNLL 2006 shared task are [[http://ilk.uvt.nl/conll/results.html|available online]]. They have been published in [[http://aclweb.org/anthology-new/W/W06/W06-2920.pdf|(Buchholz and Marsi, 2006)]]. The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Japanese:
  
 ^ Parser (Authors) ^ LAS ^ UAS ^ ^ Parser (Authors) ^ LAS ^ UAS ^
-MST (McDonald et al.) | 84.79 90.58 +Basis (John O'Neil) | 90.57 93.16 
-Malt (Nivre et al.) | 84.77 89.80 +Nara (Yuchang Cheng) | 89.91 93.12 
-Riedel et al. | 83.63 89.66 |+Malt (Nivre et al.91.65 93.10 |
  

[ Back to the navigation ] [ Back to the content ]