Table of Contents

GAUK – Tomasz Limisiewicz

If you have any questions, please contact me via mail: limisiewicz@ufal.mff.cuni.cz

Basic information about project No. 338521

Czech title of project: Zkoumání mnohojazyčných reprezentací jazykových jednotek v neuronových sítích
English title of project: Exploring Multilingual Representations of Language Units in Neural Networks
Current principal researcher: Ing. Tomasz Limisiewicz Vytvořit novou zprávu pro Tomasz Limisiewicz t.limisiewicz@gmail.com
First applicant: Tomasz Limisiewicz
Study programme: Matematicko-fyzikální fakulta
Programme: Computational linguistics
Field: Computational linguistics
Type of study programme: PhD programme
Academic department / institute: Ústav formální a aplikované lingvistiky
Year of project inception: 2021
Duration of project: 3
Interdisciplinarity: no
Workplace Institute of Formal and Applied Linguistics
Area board section: Social sciences - Computer science (INF)

Research team

Ing. Tomasz Limisiewicz Scholarships 80/80
RNDr. David Mareček Ph.D. Work Assignment Agreement 20/10

Description of research team - year 2021:

The implementation team consists of one Ph.D. candidate and his supervisor.

Tomasz Limisiewicz is the prospective principal investigator of the project. He is working on the representations of language units in shared multilingual space. He will focus on developing transformations of pre-training embeddings and analyzing multilingual embeddings. Tomasz has the experience of cooperation with the industry and organizing work of a small team gained through his previous occupations in startups and an R&D department.

David Mareček is an assistant professor at the Institute of Formal and Applied Linguistics, MFF UK. His research is focused mainly on the interpretation of deep neural networks, specifically how words or sentences are represented in neural networks solving different NLP problems and what language features are important to them. He also works on unsupervised machine learning, machine translation, and dependency parsing across languages. He is the project investigator of the GAČR project “Linguistic Structure Representation in Neural Networks” and also participates in other grant projects (TAČR, OP VVV). He will provide the methodical guidance of this project and assistance in the preparation of publications and presentations of results.

Financial requirements

Item Year 2021
Other non-investment costs náklady 6/6
Travel costs 30/30
Indirect costs 20/18
Personnel costs (salaries) and stipends 100/90
Total 156 / 144

Structure of financial requirements - year 2021

The amounts for scholarships and salaries are proposed in accordance with the requirements of the UK Grant Agency.

Travel costs 30 000 CZK
One of the conferences: TSD2021, NAACL 2021, EACL2021, ACL2021, EMNLP2021.
Participation in one on-site conferences: 7000 CZK (registration fee) 8000 CZK (travel expenses), 7000 CZK (accommodation), 8000 CZK (diet). If the conferences will be held on-line, the costs will be reduced.

Other non-investment costs: 6 000 CZK
Printing posters: 1 000 CZK
Books: 5 000 CZK

Financial outlook for following years

Year 2022 160
Year 2023 160

Additional information

Summary:

The latent representations of Neural Networks trained on large corpora to perform language modeling or machine translation (also known as language embeddings) were proven to encode various linguistic features. The vectors computed by various models such as ELMo [1] or BERT [2] are now routinely used as input to neural models solving specific down-stream language tasks and often achieve state-of-the-art results.

Nevertheless, the aforementioned methods of computing reusable representations have drawbacks. The models act as black boxes, where only the input and the output have obvious linguistic interpretation while the processing performed inside remains opaque. The related problem is a flawed generalization: once a model is trained on certain data distribution, it is hard to use the learned information when the distribution (or the task itself) changes.

The project tackles the issues of limited explainability and poor generalization by developing new ways of a) providing insight into the pre-trained representations; b) transforming the representations so that the information encoded within is more accessible to both human interpreters and/or other neural models; c) improving embeddings by providing additional linguistic signal during training. Our analysis will use representations of the models trained on many languages, which lie in a shared cross-lingual space.

Current state of knowledge:

Language can be represented in a metric space, a phenomenon that allows us to express linguistic features (i.e., synonymity, syntactic relations) mathematically. Interestingly, using multiple languages for embedding in common space does not hinder the performance in underlying languages. Furthermore, it opens the door for applications in cross-language transfer, including machine translation. The field was recently developing intensively, starting with the introduction of various approaches to align many monolingual vector distributions by a spatial transformation based on the seed dictionary [3, 4]. The supervision requirement was then dropped by [5], and [6], the authors proposed self-learning and adversarial training to obtain both the word translations and the embeddings. The recent sequence-to-sequence neural systems are trained to jointly encode and decode many languages, in which latent representations are cross-lingual and additionally encode the context of a language unit [7, 8].

The remarkable results obtained by applying pre-trained representations to various NLP tasks provoked a question about the embeddings' interpretation. Researchers focused on analyzing how well the embeddings capture particular linguistic phenomena, e.g., syntax, lexical information, etc. [11, 12]. The current open questions in that field are whether we can identify particular dimensions of the embeddings responsible for specific language features? We plan to extend such analysis to vectors in a space shared for multiple languages. Some promising results showed that cross-lingual embeddings capture typological differences between distinct languages [9, 10]. We want to further investigate this phenomenon and answer whether high resource multilingual data can result in better interpretations.

In parallel, researchers focused on providing an additional linguistic signal during embeddings' training to improve their performance, i.e., embedding enriching [11, 13]. As far as we know, there were just a few attempts to apply this technique for multilingual embeddings [14].

[1] Deep contextualized word representation, Peters et al. 2018
[2] Bert: Pre-training of deep bidirectional transformers for language understanding, Devlin et al. 2019
[3] Exploiting Similarities among Languages for Machine Translation, Mikolov et al. 2013
[4] Offline bilingual word vectors orthogonal transformations and the inverted softmax, Smith et al. 2017
[5] A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings, Artexe et al 2018
[6] Word Translation Without Parallel Data, Conneau et al 2018
[7] Massively Multilingual Sentence Embeddings for Zero-Shot Cross-Lingual Transfer and Beyond, Artexte and Schwenk 2019
[8] Google's Multilingual Neural Machine Translation System, Johnson et al 2017
[9] How multilingual is Multilingual BERT? Pires et al 2019
[10] How Language-Neutral is Multilingual BERT? Libovický et al 2019
[11] A Structural Probe for Finding Syntax in Word Representations, Hewitt and Manning 2019
[12] Linguistic knowledge and transferability of contextual representations, Liu et al. 2019
[13] Incorporating Syntactic and Semantic Information in Word Embeddings using Graph Convolutional Networks, Vashishth et al.
[14] Cross-Lingual BERT Transformation for Zero-Shot Dependency Parsing, Wang et al. 2019
[15] Cross-lingual Language Model Pretraining, Lample and Conneau 2019

Explanation of relations with other projects addressed by the Supervisor or Principal Researcher:

The prospective principal investigator has applied for a START project entitled “Learning Structured and Explainable Representations of Language Units in Neural Networks” with a broader scope than in this proposal. In case the START project application is successful this proposal will be withdrawn.
The supervisor is a project investigator of the GAČR project “Linguistic Structure Representation in Neural Networks” which is going to be finished by the end of 2020. The proposed GAUK project builds on it and aims to investigate research questions that were not answered yet. The supervisor also applied for another GAČR grant “Visualization of Sentence Structure learned in Transformer Neural Networks”, whose research goal would be rather orthogonal to this proposal aiming to visualize the structures hidden in the representations.

Facilities at the project's disposal:

The project will be carried out at the Institute of Formal and Applied Linguistics (ÚFAL), Charles University, which is a workplace of both implementation team members.

The department is currently well-equipped in the Information and Communication technology area. Linguistic Research Cluster (LRC) for computational purposes is continuously built-up – currently, there are 1700 CPU cores, 100 GPUs, and 9 TB of RAM (aggregated values over the LRC ). Used storage space is extendable, and every new project is allowed to add its own reserved disk space. Basic operating systems are Linux and Windows 7/10.

Project's research objectives:

The main goal of the project is to develop methods to produce explainable multilingual representations of language data. We hypothesize that more explainable pre-trained representations result in more effective and faster training of models solving down-stream tasks. Explainability is a well-established topic in the field of natural language processing (NLP). Besides theoretical interest (i.e., do neural networks discover some linguistic categories?), it is also motivated by many applications where decisions of the models need to be justified (e.g., health care, finance, etc.). The project's novelty lies in: the hypothesis about the connection between explainability and further model performance and proposing new techniques of obtaining transparent cross-lingual embeddings.

We will explore representations of neural networks in a shared cross-lingual space. Our motivation is to show that analysis generalizes well to many languages. We will evaluate low-resource languages that were generally out of the scope of the previous studies. We hypothesize that the cross-lingual transfer can be especially beneficial for that purpose. Therefore, our novel explainable embeddings would benefit further research on languages with only small corpora available.

We will create new general-purpose representations that will lead to a more straightforward interpretation. To this end, we will use pre-trained embeddings to which specific transformations will be applied. We hypothesize that not only are such representations more interpretable, but they also lead to faster training of models on down-stream tasks since they make the learned information more readily accessible.

We will examine the possibilities of learning multilingual representation enriched with linguistic information. We will evaluate whether this technique can improve the performance of the embeddings.

Methods of research:

In our research, we will analyze the embeddings produced by NLP systems trained on large corpora, e.g., for language modeling or translation. Such representations can be re-used to solve other linguistic down-stream tasks, e.g., morphological tagging, syntactic parsing, coreference resolution. The results in these auxiliary objectives indicate how universal the representations are and which linguistic features are encoded in them.

We will implement the transformations of existing embeddings to extract and visualize the information related to specific linguistic phenomena. For this purpose, we will use probing [12]. We will fix the model's parameters and transform the model's embeddings with a small neural network optimized for a down-stream task. We will focus on dividing this transformed representation into dimensions capturing particular linguistic phenomena: syntax, lexical information, coreferential links, etc. Probing is a supervised method, i.e. it requires additional training data to tweak a transformation. Therefore, transformed embeddings will be used to evaluate the performance of enriched representations. Alternatively, we can introduce the auxiliary supervision, e.g. syntactic structure, during the training of base neural model.

Multilingual representations will be obtained by training neural models on many languages together [15]. We will focus on enhancing language independence of the embeddings and evaluation of cross-lingual transfer, especially to low-resource languages. Furthermore, we will extend our analysis of encoded features to typology [9, 10].

Presentation of results:

We will gradually deliver our project's outcomes in the course of three years.

In the first year, we will develop programming tools and preparation of data and instances of models for analysis. In this time, we will aim to publish at least one paper in conferences or workshops (TSD2021, BlackboxNLP 2021). We will release our code publicly under an open-source license.

In the second year, we will publish at least one paper in a venue or journal of similar prominence. Furthermore, we will release the final version of the code and assure that our results are replicable. We want to disseminate pre-trained instances of our models in LINDAT repositories so that others can use our embeddings in their research.

In the third year, we expect to publish at least one paper at one of the most renowned conferences: ACL, NAACL, EACL, EMNLP, or a relevant accompanying workshop. The work will be summarized in the principal investigator's Ph.D. thesis.

Attachments