Discourse Annotation: The case of TED Multilingual Discourse Bank (TED-MDB)
(Deniz Zeyrek)
Corpora annotated at the discourse level can help theoretical advancements and are ultimately inputs to various language technology tasks.
The Penn Discourse TreeBank (PDTB) is a richly-annotated resource for discourse relations in English (Prasad, et al. 2014). It has already
been used reliably for annotating discourse in various languages such as Hindi (Oza, et al., 2009), Chinese (Zhou & Xue, 2015),
and Turkish (Zeyrek, et al. 2013). This talk will introduce a new multilingual discourse annotation effort, an initiative undertaken
by a group of scholars within Textlink, annotating discourse in the PDTB style. Different from the monolingual corpora, TED Multilingual
Discourse Bank, or TED-MDB involves the parallel annotation of a subset of TED talks in six languages — English, Turkish,
European Portuguese, Polish, German and Russian. It annotates both explicit and implicit discourse relations at the inter-sentential level,
focusing on explicit relations at the intra-sentential level. I will describe our on-going work on the corpus, and discuss the benefits
and challenges involved in creating it.
References
Oza, U. et al. (2009). The hindi discourse relation bank. Proceedings of the third linguistic annotation workshop (pp. 158-161). Association for Computational Linguistics.
Prasad, R. et al. (2014). Reflections on the penn discourse treebank, comparable corpora, and complementary annotation. Computational Linguistics .
Zhou, Y. & Xue, N. (2015). The Chinese Discourse TreeBank: a Chinese corpus annotated with discourse relations. Language Resources and Evaluation , 49 (2), 397-431.
Zeyrek, D. et al. (2013). Turkish Discourse Bank: Porting a discourse annotation style to a morphologically rich language. Discourse and Dialogue , 4 (3), 174-184.