Differences
This shows you the differences between two versions of the page.
Next revision | Previous revision | ||
nianwen_abstract [2012/09/11 13:09] ufal vytvořeno |
— (current) | ||
---|---|---|---|
Line 1: | Line 1: | ||
- | In the field of computational linguistics or natural language | ||
- | processing, progress in discourse analysis has been relatively slow, | ||
- | as compared with syntactic parsing or semantic analysis (e.g., word | ||
- | sense disambiguation, | ||
- | statistical, | ||
- | common linguistic resource that is widely accepted by the community is | ||
- | key to advancing the state of the art in this area. To create | ||
- | consistently annotated data for discourse analysis is particularly | ||
- | challenging because one has to deal with larger linguistic structures | ||
- | and there are few linguistic rules to follow. The key to successful | ||
- | discourse annotation is to identify a well-grounded linguistic theory | ||
- | that can be easily operationalized. In the Penn Discourse Treebank | ||
- | (Prasad et al 2008, Webber and Joshi 1998) the field may have found | ||
- | such a theory. In the PDTB conception, discourse relations revolve | ||
- | around discourse connectives, | ||
- | predicate that takes two arguments. In this way, discourse annotations | ||
- | are anchored by discourse connectives and are thus lexicalized. In our | ||
- | view, lexicalization has been crucial to the success of the PDTB as an | ||
- | annotation project, a large-scale effort characterized by high | ||
- | inter-annotator agreement, a standard metric for annotation | ||
- | consistency. Lexicalization makes highly abstract discourse relations | ||
- | grounded to a specific lexical item. In doing so, it localizes the | ||
- | ambiguity in discourse relations to discourse connectives, | ||
- | lexical item can have either a discourse connective use or a | ||
- | non-discourse connective use (e.g., ``when" | ||
- | connective can be ambiguous between different discourse relations | ||
- | (e.g., ``since" | ||
- | annotation task because each annotator can focus on only one discourse | ||
- | connective at a time instead of scores of discourse relations. This in | ||
- | turn enlarges the annotator pool and more annotators will be able to | ||
- | perform the task without having to have extensive training. The long | ||
- | list of annotators who worked on the PDTB annotation attests to this | ||
- | observation. A larger annotator pool and a shorter learning curve | ||
- | translates to the scalability of such an approach. | ||
- | |||
- | If lexicalization is so important to discourse annotation, what about | ||
- | discourse relations that are not anchored by an explicit discourse | ||
- | connective? The PDTB addresses this by assuming there is an {\it | ||
- | implicit} discourse connective that connects its two arguments, which | ||
- | are typically (parts of) adjacent sentences. This is operationalized | ||
- | by identifying punctuation marks (e.g., periods) that serve as | ||
- | boundaries of two adjacent sentences as anchors of implicit discourse | ||
- | relations. | ||
- | which discourse connective can be plausibly inserted between these two | ||
- | adjacent sentences. In doing so, the PDTB assumes that (1) the range | ||
- | of possible discourse relations anchored by implicit discourse | ||
- | connectives are basically the same as those anchored by explicit | ||
- | discourse relations, and (2) discourse relations anchored by implicit | ||
- | discourse connectives are mostly local. The first assumption is | ||
- | largely born out in the PDTB. Either a discourse connective can be | ||
- | inserted between two adjacent sentences, | ||
- | fact that they talk about the same entities, or there is no relation | ||
- | between them. The last possibility has a direct bearing on the second | ||
- | question: if there is no relation between two adjacent sentences, does | ||
- | that mean that these sentences have no discourse relations at all with | ||
- | the rest of the text, or that they are related to other discourse | ||
- | segments that are non-local? | ||
- | discourse segments are related in a coherent piece of text, and large | ||
- | number of such ``no-relations" | ||
- | to the PDTB approach. | ||
- | |||
- | While it might not be too much to expect that the same high-level | ||
- | discourse relations hold across languages, it is almost certainly too | ||
- | much to expect that discourse relations are lexicalized in the same | ||
- | way across languages. The question is whether a lexicalized approach | ||
- | to discourse analysis can still be maintained in languages where | ||
- | discourse relations are lexicalized in ways that are significantly | ||
- | different from English . Our experience in a pilot PDTB-style Chinese | ||
- | discourse annotation project shows that the lexicalized approach can | ||
- | be effectively adopted, although significant adaptations have to be | ||
- | made. Chinese has the same types of discourse connectives (subordinate | ||
- | conjunctions, | ||
- | English, but they occur much less frequently because they can often be | ||
- | | ||
- | 80/20 (Zhou and Xue, 2012) rather than the roughly 50/50 split | ||
- | reported for PDTB (Prasad et al 2008). However, by identifying | ||
- | punctuation marks as boundaries of discourse segments and test whether | ||
- | lexicalized discourse relations hold between adjacent comma-separated | ||
- | discourse segments, we are able to show that Chinese discourse | ||
- | annotation can be performed with very good consistency. More evidence | ||
- | has to be gathered from the experience of other languages to test the | ||
- | feasibility of lexicalized approaches to discourse annotation in a | ||
- | multi-lingual setting, and such evidence will come soon now that such | ||
- | an approach has been adopted in a number of discourse annotation | ||
- | projects for a variety of different languages. | ||
- | |||
- | |||
- | |||
- | Bonnie Webber and Aravind Joshi. 1998. Anchoring a Lexicalized | ||
- | Tree-Adjoining grammar for Discourse. In Proceedings of ACL/COLING | ||
- | Workshop on Discourse Relations and Discourse Markers, Montreal, | ||
- | Canada. August 1998. | ||
- | |||
- | Rashmi Prasad, Nikhil Dinesh, Alan Lee, Eleni Miltsakaki, Livio | ||
- | Robaldo, Aravind Joshi, and Bonnie Webber. 2010. The Penn Discourse | ||
- | Treebank 2.0. | ||
- | In Proceedings of the 6th International Conference on Language | ||
- | Resources and Evaluation (LREC 2008). Marrackech, Morocco. June 2008. | ||
- | |||
- | Yuping Zhou and Nianwen Xue. 2012. PDTB-style discourse annotation of | ||
- | Chinese text. In Proceedings of ACL-2012. Jeju Island, Korea. | ||
- | |||