[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

This is an old revision of the document!


Nianwen (Bert) Xue, Assistant Professor Brandeis University. USA
Explicit and implicit discourse relations from a cross-lingual perspective – from experience in working on Chinese discourse annotation

Abstract

In the field of computational linguistics or natural language
processing, progress in discourse analysis has been relatively slow,
as compared with syntactic parsing or semantic analysis (e.g., word
sense disambiguation, semantic role labeling). In this age when
statistical, data-driven approaches dominate the field, having a
common linguistic resource that is widely accepted by the community is
key to advancing the state of the art in this area. To create
consistently annotated data for discourse analysis is particularly
challenging because one has to deal with larger linguistic structures
and there are few linguistic rules to follow. The key to successful
discourse annotation is to identify a well-grounded linguistic theory
that can be easily operationalized. In the Penn Discourse Treebank
(Prasad et al 2008, Webber and Joshi 1998) the field may have found
such a theory. In the PDTB conception, discourse relations revolve
around discourse connectives, where each discourse connective is a
predicate that takes two arguments. In this way, discourse annotations
are anchored by discourse connectives and are thus lexicalized. In our
view, lexicalization has been crucial to the success of the PDTB as an
annotation project, a large-scale effort characterized by high
inter-annotator agreement, a standard metric for annotation
consistency. Lexicalization makes highly abstract discourse relations
grounded to a specific lexical item. In doing so, it localizes the
ambiguity in discourse relations to discourse connectives, where a
lexical item can have either a discourse connective use or a
non-discourse connective use (e.g., ``when“), and one discourse
connective can be ambiguous between different discourse relations
(e.g., ``since”). As a result, it reduces the cognitive load of the
annotation task because each annotator can focus on only one discourse
connective at a time instead of scores of discourse relations. This in
turn enlarges the annotator pool and more annotators will be able to
perform the task without having to have extensive training. The long
list of annotators who worked on the PDTB annotation attests to this
observation. A larger annotator pool and a shorter learning curve
translates to the scalability of such an approach.

If lexicalization is so important to discourse annotation, what about
discourse relations that are not anchored by an explicit discourse
connective? The PDTB addresses this by assuming there is an {\it
implicit} discourse connective that connects its two arguments, which
are typically (parts of) adjacent sentences. This is operationalized
by identifying punctuation marks (e.g., periods) that serve as
boundaries of two adjacent sentences as anchors of implicit discourse
relations. The specific discourse relation is determined by testing
which discourse connective can be plausibly inserted between these two
adjacent sentences. In doing so, the PDTB assumes that (1) the range
of possible discourse relations anchored by implicit discourse
connectives are basically the same as those anchored by explicit
discourse relations, and (2) discourse relations anchored by implicit
discourse connectives are mostly local. The first assumption is
largely born out in the PDTB. Either a discourse connective can be
inserted between two adjacent sentences, or they are related by the
fact that they talk about the same entities, or there is no relation
between them. The last possibility has a direct bearing on the second
question: if there is no relation between two adjacent sentences, does
that mean that these sentences have no discourse relations at all with
the rest of the text, or that they are related to other discourse
segments that are non-local? It is reasonable to assume that all
discourse segments are related in a coherent piece of text, and large
number of such ``no-relations“ would call for a significant expansion
to the PDTB approach.

While it might not be too much to expect that the same high-level
discourse relations hold across languages, it is almost certainly too
much to expect that discourse relations are lexicalized in the same
way across languages. The question is whether a lexicalized approach
to discourse analysis can still be maintained in languages where
discourse relations are lexicalized in ways that are significantly
different from English . Our experience in a pilot PDTB-style Chinese
discourse annotation project shows that the lexicalized approach can
be effectively adopted, although significant adaptations have to be
made. Chinese has the same types of discourse connectives (subordinate
conjunctions, coordinate conjunctions, and discourse adverbials) as
English, but they occur much less frequently because they can often be
dropped. The ratio of implicit and explicit connectives is about
80/20 (Zhou and Xue, 2012) rather than the roughly 50/50 split
reported for PDTB (Prasad et al 2008). However, by identifying
punctuation marks as boundaries of discourse segments and test whether
lexicalized discourse relations hold between adjacent comma-separated
discourse segments, we are able to show that Chinese discourse
annotation can be performed with very good consistency. More evidence
has to be gathered from the experience of other languages to test the
feasibility of lexicalized approaches to discourse annotation in a
multi-lingual setting, and such evidence will come soon now that such
an approach has been adopted in a number of discourse annotation
projects for a variety of different languages.

Bonnie Webber and Aravind Joshi. 1998. Anchoring a Lexicalized
Tree-Adjoining grammar for Discourse. In Proceedings of ACL/COLING
Workshop on Discourse Relations and Discourse Markers, Montreal,
Canada. August 1998.

Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio
Robaldo, Aravind Joshi, and Bonnie Webber. 2010. The Penn Discourse
Treebank 2.0.
In Proceedings of the 6th International Conference on Language
Resources and Evaluation (LREC 2008). Marrackech, Morocco. June 2008.

Yuping Zhou and Nianwen Xue. 2012. PDTB-style discourse annotation of
Chinese text. In Proceedings of ACL-2012. Jeju Island, Korea.


[ Back to the navigation ] [ Back to the content ]