user:zeman:treebanks:ja

Japanese (ja)
- Versions
- Obtaining and License
- References
- Domain
- Size
- Inside
- Sample
- Parsing

Japanese (ja)

Tübingen Treebank of Spoken Japanese (TüBa-J/S, Verbmobil project)

Versions

Original TüBa-J/S in the Negra export format
CoNLL 2006

The original TüBa-J/S is HPSG-oriented, there is the lexical level, the phrasal level, the clausal level, and dependency edges between nodes. The CoNLL version contains only the dependency relations.

Obtaining and License

To obtain the treebank, download the license agreement, print it, fill it out and sign it, scan and send it back to Kathrin Beck (kbeck (at) sfs (dot) uni-tuebingen (dot) de). She will send you the password for the download page. The license in short:

academic research usage
no redistribution
citation in publications not required (but it is common decency)

TüBa-J/S was created in the Verbmobil project by members of the Seminar für Sprachwissenschaft, Eberhard Karls Universität Tübingen, Wilhelmstrasse 19, D-72074 Tübingen, Germany.

References

Website
- http://www.sfs.uni-tuebingen.de/en/tuebajs.shtml
Data
- no separate citation
Principal publications
- Yasuhiro Kawata, Julia Bartels: Stylebook for the Japanese Treebank in Verbmobil, Report 240, September 29, 2000.
- Sabine Buchholz, Erwin Marsi: CoNLL-X shared task on Multilingual Dependency Parsing. In: Proceedings of the 10th Conference on Computational Natural Language Learning (CoNLL-X), pp. 149-164, New York, USA, 2006.
Documentation
- Yasuhiro Kawata, Julia Bartels: Stylebook for the Japanese Treebank in Verbmobil, Report 240, has been distributed together with the CoNLL 2006 version of the treebank (file doc/report-240-00.ps).

Domain

Spoken dialogues, negotiations about time and place of business meetings. That is why many sentences are relatively short (a frequent single-word sentence is hai = “yes”).

Size

The CoNLL 2006 version contains 157,172 tokens in 17753 sentences, yielding 8.85 tokens per sentence on average (CoNLL 2006 data split: 151,461 tokens / 17044 sentences training, 5711 tokens / 709 sentences test).

Inside

The text has been romanized and the original characters (kanji + kana) are not available. There should be a 1-1 mapping between the romanized text (rōmaji) and the Japanese script of hiragana. There is no indication though where katakana or kanji are preferred over hiragana.

The morphological analysis does not include lemmas. The part-of-speech tags have been assigned (probably) manually. Few morphosyntactic features are used.

Sample

The first three sentences of the CoNLL 2006 training data:

1	kasahara	_	NAME	NAMEper	_	2	HD	2	HD
2	arisa	_	NAME	NAMEper	_	3	COMP	3	COMP
3	desu	_	PV	PVfin	u	0	ROOT	0	ROOT
4	.	_	.	.	_	3	PUNCT	3	PUNCT

1	kadowaki	_	NAME	NAMEper	_	2	HD	2	HD
2	masakazu	_	NAME	NAMEper	_	3	COMP	3	COMP
3	desu	_	PV	PVfin	u	0	ROOT	0	ROOT
4	.	_	.	.	_	3	PUNCT	3	PUNCT

1	kadowaki	_	NAME	NAMEper	_	2	COMP	2	COMP
2	saN	_	P	PNsf	_	0	ROOT	0	ROOT
3	omatase	_	N	VN	_	4	COMP	4	COMP
4	shimashita	_	VS	VSfin	ta	0	ROOT	0	ROOT
5	.	_	.	.	_	4	PUNCT	4	PUNCT

The first three sentences of the CoNLL 2006 test data:

1	tashiro	_	NAME	NAMEper	_	2	HD	2	HD
2	yasuko	_	NAME	NAMEper	_	3	COMP	3	COMP
3	desu	_	PV	PVfin	u	0	ROOT	0	ROOT
4	.	_	.	.	_	3	PUNCT	3	PUNCT

1	hayakawa	_	NAME	NAMEper	_	2	HD	2	HD
2	ryou	_	NAME	NAMEper	_	3	COMP	3	COMP
3	desu	_	PV	PVfin	u	0	ROOT	0	ROOT
4	.	_	.	.	_	3	PUNCT	3	PUNCT

1	hayakawa	_	NAME	NAMEper	_	2	COMP	2	COMP
2	saN	_	P	PNsf	_	6	ADJ	6	ADJ
3	ima	_	N	Ntmp	_	6	ADJ	6	ADJ
4	chotto	_	ADV	ADV	_	6	ADJ	6	ADJ
5	ojikaN	_	N	NN	_	6	SBJ	6	SBJ
6	yoroshii	_	ADJ	ADJifin	_	7	COMP	7	COMP
7	desu	_	PV	PVfin	u	0	ROOT	0	ROOT
8	ka	_	PS	PSE	_	7	MRK	7	MRK
9	chotto	_	ADV	ADV	_	16	ADJ	16	ADJ
10	doitsu	_	NAME	NAMEloc	_	11	COMP	11	COMP
11	no	_	P	Pgen	_	12	ADJ	12	ADJ
12	shucchou	_	N	VN	_	13	COMP	13	COMP
13	no	_	P	Pgen	_	14	COMP	14	COMP
14	koto	_	N	NF	_	15	COMP	15	COMP
15	de	_	P	P	_	16	ADJ	16	ADJ
16	gosoudaN	_	N	VN	_	17	COMP	17	COMP
17	shitai	_	VADJ	VADJi	_	18	COMP	18	COMP
18	no	_	N	NF	_	19	COMP	19	COMP
19	desu	_	PV	PVfin	u	20	COMP	20	COMP
20	ga	_	PS	PSSb	_	0	ROOT	0	ROOT
21	.	_	.	.	_	20	PUNCT	20	PUNCT

Parsing

Nonprojectivities in TüBa-J/S are not frequent. Only 1736 of the 157,172 tokens in the CoNLL 2006 version are attached nonprojectively (1.1%).

The results of the CoNLL 2006 shared task are available online. They have been published in (Buchholz and Marsi, 2006). The evaluation procedure was non-standard because it excluded punctuation tokens. These are the best results for Japanese:

Parser (Authors)	LAS	UAS
Basis (John O'Neil)	90.57	93.16
Nara (Yuchang Cheng)	89.91	93.12
Malt (Nivre et al.)	91.65	93.10

Table of Contents