This is an old revision of the document!
Table of Contents
Japanese (ja)
Tübingen Treebank of Spoken Japanese (TüBa-J/S, Verbmobil project)
Versions
- Original TüBa-J/S in the Negra export format
- CoNLL 2006
The original TüBa-J/S is HPSG-oriented, there is the lexical level, the phrasal level, the clausal level, and dependency edges between nodes. The CoNLL version contains only the dependency relations.
Obtaining and License
To obtain the treebank, download the license agreement, print it, fill it out and sign it, scan and send it back to Kathrin Beck (kbeck (at) sfs (dot) uni-tuebingen (dot) de). She will send you the password for the download page. The license in short:
- academic research usage
- no redistribution
- citation in publications not required (but it is common decency)
TüBa-J/S was created in the Verbmobil project by members of the Seminar für Sprachwissenschaft, Eberhard Karls Universität Tübingen, Wilhelmstrasse 19, D-72074 Tübingen, Germany.
References
- Website
- Data
- no separate citation
- Principal publications
- Yasuhiro Kawata, Julia Bartels: Stylebook for the Japanese Treebank in Verbmobil, Report 240, September 29, 2000.
- Sabine Buchholz, Erwin Marsi: CoNLL-X shared task on Multilingual Dependency Parsing. In: Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X), pp. 149-164, New York, USA, 2006.
- Documentation
- Yasuhiro Kawata, Julia Bartels: Stylebook for the Japanese Treebank in Verbmobil, Report 240, has been distributed together with the CoNLL 2006 version of the treebank (file
doc/report-240-00.ps
).
Domain
Spoken dialogues, negotiations about time and place of business meetings. That is why many sentences are relatively short (a frequent single-word sentence is hai = “yes”).
Size
The CoNLL 2006 version contains 157,172 tokens in 17753 sentences, yielding 8.85 tokens per sentence on average (CoNLL 2006 data split: 151,461 tokens / 17044 sentences training, 5711 tokens / 709 sentences test).
Inside
The original morphosyntactic tags have been converted to fit into the three columns (CPOS, POS and FEAT) of the CoNLL format. There should be a 1-1 mapping between the DDT positional tags and the CoNLL 2006 annotation. Use DZ Interset to inspect the CoNLL tagset.
The morphological analysis in the CoNLL 2006 version does not include lemmas (the original DTAG version does contain them). The morphosyntactic tags have been assigned (probably) manually.
Some multi-word expressions have been collapsed into one token, using underscore as the joining character. This includes adverbially used prepositional phrases (e.g. i_lørdags = on Saturdays) but not named entities.
Sample
The first sentence of DDT 1.0 in the DTAG format:
<tei.2> <teiHeader type=text> <fileDesc> <titleStmt> <title>Tagged sample of: 'Jeltsins skæbnetime'</title> </titleStmt> <extent words=158>158 running words</extent> <publicationStmt> <distributor>PAROLE-DK</distributor> <address><addrline>Christians Brygge 1,1., DK-1219 Copenhagen K.</address> <date>1998-06-02</date> <availability status=restricted><p>by agreement with distributor</availability> </publicationStmt> <sourceDesc> <biblStruct> <analytic> <title>Jeltsins skæbnetime</title> <author gender=m born=1925>Nikulin, Leon</author> </analytic> <monogr> <imprint><pubPlace>Denmark</pubPlace> <publisher>Det Fri Aktuelt</publisher> <date>1992-12-01</date> </imprint> </monogr> </biblStruct> </sourceDesc> </fileDesc> <profileDesc> <creation>1992-12-01</creation> <langUsage><language>Danish</langUsage> <textClass> <catRef target="P.M2"> <catRef target="P.G4.8"> <catRef target="P.T9.3"> </textClass> </profileDesc> </teiHeader> <text id=AJK> <body> <div1 type=main> <p> <s> <W lemma="to" msd="AC---U=--" in="9:subj" out="1:mod|2:mod|3:nobj|5:appr">To</W> <W lemma="kendt" msd="ANP[CN]PU=[DI]U" in="-1:mod" out="">kendte</W> <W lemma="russisk" msd="ANP[CN]PU=[DI]U" in="-2:mod" out="">russiske</W> <W lemma="historiker" msd="NCCPU==I" in="-3:nobj" out="">historikere</W> <W lemma="Andronik" msd="NP--U==-" in="1:namef" out="">Andronik</W> <W lemma="Mirganjan" msd="NP--U==-" in="-5:appr" out="-1:namef|1:coord">Mirganjan</W> <W lemma="og" msd="CC" in="-1:coord" out="2:conj">og</W> <W lemma="Igor" msd="NP--U==-" in="1:namef" out="">Igor</W> <W lemma="Klamkin" msd="NP--U==-" in="-2:conj" out="-1:namef">Klamkin</W> <W lemma="tro" msd="VADR=----A-" in="" out="-9:subj|1:mod|2:pnct|3:dobj|12:pnct">tror</W> <W lemma="ikke" msd="RGU" in="-1:mod" out="">ikke</W> <W lemma="," msd="XP" in="-2:pnct" out="">,</W> <W lemma="at" msd="CS" in="-3:dobj" out="2:vobj">at</W> <W lemma="Rusland" msd="NP--U==-" in="1:subj|2:[subj]" out="">Rusland</W> <W lemma="kunne" msd="VADR=----A-" in="-2:vobj" out="-1:subj|1:vobj|2:mod">kan</W> <W lemma="udvikle" msd="VAF-=----P-" in="-1:vobj" out="-2:[subj]">udvikles</W> <W lemma="uden" msd="SP" in="-2:mod" out="1:nobj">uden</W> <W lemma="en" msd="PI-CSU--U" in="-1:nobj" out="2:nobj">en</W> <W lemma=""" msd="XP" in="1:pnct" out="">"</W> <W lemma="jernnæve" msd="NCCSU==I" in="-2:nobj" out="-1:pnct|1:pnct">jernnæve</W> <W lemma=""" msd="XP" in="-1:pnct" out="">"</W> <W lemma="." msd="XP" in="-12:pnct" out="">.</W> </s>
The first sentence of the CoNLL 2006 training data:
1 | Samme | _ | A | AN | degree=pos|gender=common/neuter|number=sing/plur|case=unmarked|def=def/indef|transcat=unmarked | 0 | ROOT | _ | _ |
2 | cifre | _ | N | NC | gender=neuter|number=plur|case=unmarked|def=indef | 1 | nobj | _ | _ |
3 | , | _ | X | XP | _ | 1 | pnct | _ | _ |
4 | de | _ | P | PD | gender=common/neuter|number=plur|case=unmarked|register=unmarked | 7 | subj | _ | _ |
5 | norske | _ | A | AN | degree=pos|gender=common/neuter|number=plur|case=unmarked|def=def/indef|transcat=unmarked | 4 | mod | _ | _ |
6 | piger | _ | N | NC | gender=common|number=plur|case=unmarked|def=indef | 4 | nobj | _ | _ |
7 | tabte | _ | V | VA | mood=indic|tense=past|voice=active | 1 | rel | _ | _ |
8 | med | _ | SP | SP | _ | 7 | pobj | _ | _ |
9 | i_lørdags | _ | RG | RG | degree=unmarked | 7 | mod | _ | _ |
10 | mod | _ | SP | SP | _ | 7 | pobj | _ | _ |
11 | VMs | _ | N | NP | case=gen | 10 | nobj | _ | _ |
12 | værtsnation | _ | N | NC | gender=common|number=sing|case=unmarked|def=indef | 11 | possd | _ | _ |
13 | . | _ | X | XP | _ | 1 | pnct | _ | _ |
The first sentence of the CoNLL 2006 test data:
1 | To | _ | A | AC | case=unmarked | 10 | subj | _ | _ |
2 | kendte | _ | A | AN | degree=pos|gender=common/neuter|number=plur|case=unmarked|def=def/indef|transcat=unmarked | 1 | mod | _ | _ |
3 | russiske | _ | A | AN | degree=pos|gender=common/neuter|number=plur|case=unmarked|def=def/indef|transcat=unmarked | 1 | mod | _ | _ |
4 | historikere | _ | N | NC | gender=common|number=plur|case=unmarked|def=indef | 1 | nobj | _ | _ |
5 | Andronik | _ | N | NP | case=unmarked | 6 | namef | _ | _ |
6 | Mirganjan | _ | N | NP | case=unmarked | 1 | appr | _ | _ |
7 | og | _ | C | CC | _ | 6 | coord | _ | _ |
8 | Igor | _ | N | NP | case=unmarked | 9 | namef | _ | _ |
9 | Klamkin | _ | N | NP | case=unmarked | 7 | conj | _ | _ |
10 | tror | _ | V | VA | mood=indic|tense=present|voice=active | 0 | ROOT | _ | _ |
11 | ikke | _ | RG | RG | degree=unmarked | 10 | mod | _ | _ |
12 | , | _ | X | XP | _ | 10 | pnct | _ | _ |
13 | at | _ | C | CS | _ | 10 | dobj | _ | _ |
14 | Rusland | _ | N | NP | case=unmarked | 15 | subj | _ | _ |
15 | kan | _ | V | VA | mood=indic|tense=present|voice=active | 13 | vobj | _ | _ |
16 | udvikles | _ | V | VA | mood=infin|voice=passive | 15 | vobj | _ | _ |
17 | uden | _ | SP | SP | _ | 15 | mod | _ | _ |
18 | en | _ | P | PI | gender=common|number=sing|case=unmarked|register=unmarked | 17 | nobj | _ | _ |
19 | “ | _ | X | XP | _ | 20 | pnct | _ | _ |
20 | jernnæve | _ | N | NC | gender=common|number=sing|case=unmarked|def=indef | 18 | nobj | _ | _ |
21 | ” | _ | X | XP | _ | 20 | pnct | _ | _ |
22 | . | _ | X | XP | _ | 10 | pnct | _ | _ |
Parsing
Nonprojectivities in DDT are not frequent. Only 988 of the 100,238 tokens in the CoNLL 2006 version are attached nonprojectively (0.99%).
The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Danish:
Parser (Authors) | LAS | UAS |
---|---|---|
MST (McDonald et al.) | 84.79 | 90.58 |
Malt (Nivre et al.) | 84.77 | 89.80 |
Riedel et al. | 83.63 | 89.66 |