cost-training-school-2017:synopsis

Statistics in linguistics - basics and case examples

(Silvie Cinková)

The tutorial seeks to provide students with a basic understanding of data analysis applied to a particular linguistic data set and to a set of working hypotheses concerning the association between genre and discourse structure, using a few common statistical methods.

The dataset contains annotations of discourse connectives extracted from the Prague Dependency Treebank 3.0. The individual occurrences of discourse connectives are annotated with two different label sets (“discourse type” and “discourse class”). In addition, the data contains sentence ID and information about the genre and size of the document for each occurrence. This data set will be used to exemplify how to:

1. describe and summarize the data set, as well as prepare it for further statistical analysis (key words: “tidy data” and “data wrangling”)
2. explore possible associations of the discourse types/classes with different genres (i.e. Do the distributions of discourse types/classes differ between genres?).

In the initial dry-run lesson, the students will get familiar with the hypotheses, the data set, and a visualization grammar to create useful diagrams. In the hands-on sessions, the students will be guided through an already written R code whose partial results they had seen in the previous lecture, so that they can take home a piece of reasonably well-understood working code.

This tutorial is meant to serve as a starter for individual studies of quantitative methods in linguistics aided by R. No effort is being made to explain the mathematical background of the statistical concepts and methods used.

References:

Poláková Lucie, Jínová Pavlína, Mírovský Jiří: Genres in the Prague Discourse Treebank. In: Proceedings of the 9th International Conference on Language Resources and Evaluation (LREC 2014), European Language Resources Association, Reykjavík, Iceland, ISBN 978-2-9517408-8-4, pp. 1320-1326, 2014