[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revision Previous revision
Next revision
Previous revision
Next revision Both sides next revision
user:zeman:treebanks:ja [2012/01/03 23:42]
zeman References.
user:zeman:treebanks:ja [2012/01/04 09:45]
zeman Inside.
Line 34: Line 34:
 ==== Domain ==== ==== Domain ====
  
-Unknown (the underlying PAROLE corpus “consists of quotations of 150-250 words from wide range of randomly selected linguistically representative Danish texts from 1983-1992.”)+Spoken dialogues, negotiations about time and place of business meetings. That is why many sentences are relatively short (a frequent single-word sentence is //hai// = “yes”).
  
 ==== Size ==== ==== Size ====
  
-The CoNLL 2006 version contains 100,238 tokens in 5512 sentences, yielding 18.19 tokens per sentence on average (CoNLL 2006 data split: 94386 tokens / 5190 sentences training, 5852 tokens / 322 sentences test).+The CoNLL 2006 version contains 157,172 tokens in 17753 sentences, yielding 8.85 tokens per sentence on average (CoNLL 2006 data split: 151,461 tokens / 17044 sentences training, 5711 tokens / 709 sentences test).
  
 ==== Inside ==== ==== Inside ====
  
-The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEATof the CoNLL format. There //should// be a 1-1 mapping between the [[http://www.buch-kromann.dk/matthias/treebank/PAROLE-manual.pdf|DDT positional tags]] and the CoNLL 2006 annotationUse [[http://quest.ms.mff.cuni.cz/cgi-bin/interset/index.pl?tagset=da::conll|DZ Interset]] to inspect the CoNLL tagset.+The text has been romanized and the original characters (kanji + kanaare not available. There should be a 1-1 mapping between the romanized text (rōmaji) and the Japanese script of hiraganaThere is no indication though where katakana or kanji are preferred over hiragana.
  
-The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually. +The morphological analysis does not include lemmas. The part-of-speech tags have been assigned (probably) manually. Few morphosyntactic features are used.
- +
-Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.+
  
 ==== Sample ==== ==== Sample ====
  
-The first sentence of DDT 1.0 in the DTAG format: +The first three sentences of the CoNLL 2006 training data:
- +
-<code xml><tei.2> +
-  <teiHeader type=text> +
-    <fileDesc> +
-      <titleStmt> +
-        <title>Tagged sample of: 'Jeltsins skæbnetime'</title> +
-      </titleStmt> +
-      <extent words=158>158 running words</extent> +
-      <publicationStmt> +
-         <distributor>PAROLE-DK</distributor> +
-         <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> +
-         <date>1998-06-02</date> +
-         <availability status=restricted><p>by agreement with distributor</availability> +
-      </publicationStmt> +
-      <sourceDesc> +
-        <biblStruct> +
-          <analytic> +
-            <title>Jeltsins skæbnetime</title> +
-            <author gender=m born=1925>Nikulin, Leon</author> +
-          </analytic> +
-          <monogr> +
-            <imprint><pubPlace>Denmark</pubPlace> +
-              <publisher>Det Fri Aktuelt</publisher> +
-              <date>1992-12-01</date> +
-            </imprint> +
-          </monogr> +
-        </biblStruct> +
-      </sourceDesc> +
-    </fileDesc> +
-    <profileDesc> +
-      <creation>1992-12-01</creation> +
-      <langUsage><language>Danish</langUsage> +
-      <textClass> +
-        <catRef target="P.M2"> +
-        <catRef target="P.G4.8"> +
-        <catRef target="P.T9.3"> +
-      </textClass> +
-    </profileDesc> +
-  </teiHeader> +
-<text id=AJK> +
-<body> +
-<div1 type=main> +
-<p> +
-<s> +
-<W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> +
-<W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> +
-<W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> +
-<W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> +
-<W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> +
-<W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> +
-<W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> +
-<W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> +
-<W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> +
-<W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> +
-<W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> +
-<W lemma="," msd="XP" in="-2:pnct" out="">,</W> +
-<W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> +
-<W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> +
-<W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> +
-<W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> +
-<W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> +
-<W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> +
-<W lemma="&quot;" msd="XP" in="1:pnct" out="">"</W> +
-<W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> +
-<W lemma="&quot;" msd="XP" in="-1:pnct" out="">"</W> +
-<W lemma="." msd="XP" in="-12:pnct" out="">.</W> +
-</s></code> +
- +
-The first sentence of the CoNLL 2006 training data:+
  
-| 1 | Samme _ | A | AN | degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=sing/plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked | 0 | ROOT | +| 1 | kasahara | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 2 | HD | 2 | HD | 
-cifre | _ | N | NC | gender=neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef nobj +| 2 | arisa | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-XP pnct +| 3 | desu | <nowiki>_</nowiki> | PV | PVfin | u | 0 | ROOT | ROOT 
-de _ | P | PD | gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked subj +<nowiki>.</nowiki> <nowiki>_</nowiki> | <nowiki>.</nowiki> | <nowiki>.</nowiki> | <nowiki>_</nowiki>PUNCT PUNCT 
-5 | norske | _ | A AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod | +| |||||||||| 
-piger NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef 4 | nobj | _ | _ | +kadowaki | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki>HD HD 
-| 7 | tabte | _ | V | VA | mood=indic<nowiki>|</nowiki>tense=past<nowiki>|</nowiki>voice=active rel +masakazu | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-med SP SP pobj +| 3 | desu | <nowiki>_</nowiki>PV PVfin ROOT ROOT | 
-i_lørdags | _ | RG RG degree=unmarked mod +| <nowiki>.</nowiki> <nowiki>_</nowiki> <nowiki>.</nowiki> | <nowiki>.</nowiki> <nowiki>_</nowiki>PUNCT PUNCT 
-10 mod | _ | SP SP | _ | pobj +| |||||||||| 
-11 VMs | _ | N | NP case=gen 10 nobj +kadowaki <nowiki>_</nowiki> NAME NAMEper <nowiki>_</nowiki> COMP COMP 
-12 værtsnation | _ | NC gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef | 11 | possd | _ | _ | +saN <nowiki>_</nowiki> PNsf <nowiki>_</nowiki> ROOT ROOT 
-| 13 | . | _ | XP 1 | pnct | _ | _ |+omatase <nowiki>_</nowiki> | N | VN <nowiki>_</nowiki> COMP COMP 
 +shimashita <nowiki>_</nowiki> VS VSfin | ta | 0 | ROOT | 0 | ROOT | 
 +| 5 | <nowiki>.</nowiki> <nowiki>_</nowiki> <nowiki>.</nowiki><nowiki>.</nowiki> <nowiki>_</nowiki> PUNCT PUNCT |
  
-The first sentence of the CoNLL 2006 test data:+The first three sentences of the CoNLL 2006 test data:
  
-| 1 | To | _ | AC case=unmarked 10 subj +| 1 | tashiro <nowiki>_</nowiki> NAME NAMEper <nowiki>_</nowiki> HD HD 
-| 2 | kendte | _ | AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod +| 2 | yasuko <nowiki>_</nowiki> NAME NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-russiske AN degree=pos<nowiki>|</nowiki>gender=common/neuter<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=def/indef<nowiki>|</nowiki>transcat=unmarked mod | +| 3 | desu | <nowiki>_</nowiki> | PV | PVfin | u | 0 | ROOT | 0 | ROOT | 
-historikere NC | gender=common<nowiki>|</nowiki>number=plur<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef nobj +| 4 | <nowiki>.</nowiki> <nowiki>_</nowiki> | <nowiki>.</nowiki> | <nowiki>.</nowiki> | <nowiki>_</nowiki>PUNCT PUNCT 
-Andronik NP case=unmarked namef +| |||||||||| 
-Mirganjan | _ | NP case=unmarked appr +| 1 | hayakawa | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 2 | HD | 2 | HD | 
-og | _ | CC | _ | 6 | coord +| 2 | ryou | <nowiki>_</nowiki> | NAME | NAMEper | <nowiki>_</nowiki> | 3 | COMP | 3 | COMP | 
-Igor | _ | N | NP case=unmarked namef +| 3 | desu | <nowiki>_</nowiki>PV PVfin ROOT ROOT | 
-Klamkin | _ | NP case=unmarked conj +| <nowiki>.</nowiki> <nowiki>_</nowiki> | <nowiki>.</nowiki> <nowiki>.</nowiki> | <nowiki>_</nowiki>PUNCT PUNCT 
-10 tror | _ | VA mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active ROOT | _ | +| |||||||||| 
-11 ikke | _ | RG RG degree=unmarked 10 mod +hayakawa <nowiki>_</nowiki> NAME NAMEper <nowiki>_</nowiki> COMP COMP 
-12 | _ | XP | _ | 10 pnct +saN <nowiki>_</nowiki> PNsf <nowiki>_</nowiki> | 6 | ADJ ADJ 
-13 at | _ | CS | _ | 10 dobj +ima <nowiki>_</nowiki> | N | Ntmp <nowiki>_</nowiki> ADJ ADJ 
-14 Rusland | _ | NP case=unmarked 15 subj +chotto <nowiki>_</nowiki> ADV ADV <nowiki>_</nowiki> ADJ ADJ 
-15 kan | _ | VA mood=indic<nowiki>|</nowiki>tense=present<nowiki>|</nowiki>voice=active | 13 | vobj +ojikaN <nowiki>_</nowiki> NN | <nowiki>_</nowiki> | 6 | SBJ | 6 | SBJ | 
-16 udvikles | _ | VA mood=infin<nowiki>|</nowiki>voice=passive 15 vobj +| 6 | yoroshii | <nowiki>_</nowiki>ADJ ADJifin <nowiki>_</nowiki> 7 | COMP | 7 | COMP 
-17 uden | _ | SP SP | _ | 15 | mod +desu <nowiki>_</nowiki> PV PVfin ROOT ROOT 
-18 en | _ | P | PI gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>register=unmarked | 17 | nobj +ka <nowiki>_</nowiki> PS PSE <nowiki>_</nowiki> MRK MRK 
-19 | _ | XP | _ | 20 pnct +chotto <nowiki>_</nowiki> ADV ADV <nowiki>_</nowiki> 16 ADJ 16 ADJ 
-20 jernnæve _ | N | NC | gender=common<nowiki>|</nowiki>number=sing<nowiki>|</nowiki>case=unmarked<nowiki>|</nowiki>def=indef 18 nobj +10 doitsu <nowiki>_</nowiki> NAME NAMEloc <nowiki>_</nowiki> 11 COMP 11 COMP 
-21 | _ | XP | _ | 20 pnct +11 no <nowiki>_</nowiki> Pgen | <nowiki>_</nowiki> | 12 | ADJ | 12 | ADJ | 
-22 | . | _ | XP | _ | 10 pnct |+| 12 | shucchou | <nowiki>_</nowiki> | N | VN | <nowiki>_</nowiki> | 13 | COMP 13 COMP 
 +13 no <nowiki>_</nowiki> Pgen | <nowiki>_</nowiki>14 COMP 14 COMP 
 +14 koto <nowiki>_</nowiki> NF <nowiki>_</nowiki> | 15 | COMP 15 COMP 
 +15 de <nowiki>_</nowiki> | P | | <nowiki>_</nowiki> | 16 | ADJ | 16 | ADJ | 
 +| 16 | gosoudaN | <nowiki>_</nowiki> | N | VN | <nowiki>_</nowiki> | 17 | COMP 17 COMP 
 +17 shitai <nowiki>_</nowiki> VADJ VADJi <nowiki>_</nowiki> 18 COMP 18 COMP 
 +18 no | <nowiki>_</nowiki> | N | NF | <nowiki>_</nowiki> | 19 | COMP | 19 | COMP | 
 +| 19 | desu | <nowiki>_</nowiki>PV PVfin 20 | COMP | 20 | COMP 
 +20 ga <nowiki>_</nowiki> PS PSSb <nowiki>_</nowiki> ROOT ROOT 
 +21 <nowiki>.</nowiki> <nowiki>_</nowiki> <nowiki>.</nowiki> <nowiki>.</nowiki> <nowiki>_</nowiki> 20 PUNCT 20 PUNCT |
  
 ==== Parsing ==== ==== Parsing ====

[ Back to the navigation ] [ Back to the content ]