Discourse Annotation: The case of TED Multilingual Discourse Bank (TED-MDB)

(Deniz Zeyrek)

Corpora annotated at the discourse level can help theoretical advancements and are ultimately inputs to various language technology tasks.
The Penn Discourse TreeBank (PDTB) is a richly-annotated resource for discourse relations in English (Prasad, et al. 2014). It has already
been used reliably for annotating discourse in various languages such as Hindi (Oza, et al., 2009), Chinese (Zhou & Xue, 2015),
and Turkish (Zeyrek, et al. 2013). This talk will introduce a new multilingual discourse annotation effort, an initiative undertaken
by a group of scholars within Textlink, annotating discourse in the PDTB style. Different from the monolingual corpora, TED Multilingual
Discourse Bank, or TED-MDB involves the parallel annotation of a subset of TED talks in six languages — English, Turkish,
European Portuguese, Polish, German and Russian. It annotates both explicit and implicit discourse relations at the inter-sentential level,
focusing on explicit relations at the intra-sentential level. I will describe our on-going work on the corpus, and discuss the benefits
and challenges involved in creating it. 


