[ Skip to the content ]

Institute of Formal and Applied Linguistics Wiki


[ Back to the navigation ]

Table of Contents

GAUK – Mateusz Krubiński

Basic information about project No. 291923

Czech title of project: Aritmetické Vlastnosti v prostoru výzev Jazykového Modelu
English title of project: Arithmetic Properties in the space of Language Model Prompts
Current principal researcher: Mgr. Mateusz Krubiński
First applicant: Mateusz Krubiński
Study programme: Matematicko-fyzikální fakulta
Programme: Computational linguistics
Field: Computational linguistics
Type of study programme: PhD programme
Academic department / institute: Ústav formální a aplikované lingvistiky
Year of project inception: 2022
Duration of project: 1
Interdisciplinarity: no
Workplace Institute of Formal and Applied Linguistics
Area board section: Social sciences - Computer science (INF)

Research team

Mgr. Mateusz Krubiński Scholarships 80/80
doc. RNDr. Pavel Pecina Ph.D Work Assignment Agreement 13/10

Description of research team - year 2023:

The research team will consist of a Ph.D. Student, Mateusz Krubiński, and his supervisor doc. RNDr. Pavel Pecina, Ph.D.

Mateusz Krubiński is the main investigator and a third-year Ph.D. Student studying Computational Linguistics at the Institute of Formal and Applied Linguistics.
His dissertation is in the area of multi-modal approaches to Natural Language Processing.
Prior to joining UFAL, he worked for two years as an NLP Engineer at the R&D department in Poland. Before that, he graduated with a master’s degree in Mathematics from the Warsaw University of Technology. He is the first author of 4 peer-reviewed publications (and 5th one under review) in the area of Machine Translation and Summarization, published at the relevant conferences. During his studies, he worked as an Applied Scientist Intern at the Amazon Development Center in Bucharest, Romania. He will be responsible for the implementation of the planned experiments.
His CV is attached.

Doc. RNDr. Pavel Pecina Ph.D. is an associate professor working in the area of Computational Linguistics at the Institute of Formal and Applied Linguistics at the Faculty of Mathematics and Physics, Charles University. His research interests include machine translation, information retrieval and multimodal data interpretation. He has international experience as a post-doc in the Machine Translation group of the Centre for Next Generation Localisation, Dublin City University, Ireland, as an intern with the Natural Language Processing group at Microsoft Research, Redmond, USA, and as a visiting student at the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, USA. He is the author of more than 100 peer-reviewed conference papers and journal articles. Scopus indexes 81 of his papers, with 793 citations (h-index=17). He was Co-PI of a Center of Excellence project CEMI (2012-2018) funded by the Czech Science Foundation and focusing on multimodal data interpretation, Co-PI of the EU H2020 projects KConnect (2015-2017), Welcome (2020-2023), MEMORISE (2022-2026) and RES-Q (2022-2026). He will supervise the research conducted within the proposed project. He will also assist with conference presentations and management of the project. His CV, including a list of selected papers, is attached.

Financial requirements

Item Year 2023
Other non-investment costs náklady 2/2
Travel costs 37/37
Indirect costs 19/19
Personnel costs (salaries) and stipends 93/90
Total 151 / 148

Structure of financial requirements - year 2023

The salaries and stipends are in line with the requirements of GAUK and the university salary rules. The principal researcher (Mgr Mateusz Krubiński), will receive the funding of CZK 80,000 for the work on the project, analysis, implementation, measurements and final presentation of results at conferences. The project supervisor (doc. RNDr. Pavel Pecina, Ph.D.) will receive a salary of CZK 13,000 for his professional consultations and assessment of prepared papers. Other non-investment expenses ( 2,000 Kč annually) cover the purchase of books, stationery, and other small office materials.
Travel and presentation expenses (37,000 Kč) will be used for in-person attendance at one of the relevant conferences, such as:

ACL 2023 (Toronto, Canada), estimates based on previous years:
Conference fee: 6000 Kč (250$)
Travel costs:
airplane ticket: 16,000 Kč (round-trip Prague/Toronto with 1 transfer)
accommodation: 10,000 Kč (5 nights, roughly 80$ per night)
per diem: 5,000 Kč (5 days, 40$ per day)
travel insurance: 800 Kč

– EMNLP 2023 (Singapore), estimates based on previous years:
Conference fee: 8000 Kč (325$)
Travel costs:
airplane ticket: 21,000 Kč (round-trip Prague/Singapore with 1 transfer)
accommodation: 3,500 Kč (5 nights, roughly 30$ per night)
per diem: 3,500 Kč (5 days, 30$ per day)
travel insurance: 800 Kč

Financial outlook for following years

N/A

Additional information

Summary:

Large, pre-trained neural language models (LM) that can effectively utilize enormous amounts of unlabeled, textual data have recently changed the whole field of Natural Language Processing (NLP).

One can categorize them into three classes: autoregressive language models (e.g. GPT/GPT-2/GPT-3), masked language models (e.g. BERT), and encoder-decoder models (e.g. T5/BERT).

All of them are trained on sequences of tokens sampled from textual data, and during training seeing hundreds of billions of tokens. Due to the unsupervised nature of their learning process, it is referred to as “pre-training”, reserving the word “training” for the supervised, downstream tasks to which they are applied.

In this project, we would like to focus on the autoregressive language models. During the pre-training, they are tasked to predict the next token x_i, given previous tokens x_0, x_1, …, x_{i-1}. This is realized with the training objective of minimizing the log-likelihood, conditioned on the previous tokens and model parameters.
It was observed, that thanks to the variety of textual data seen during training (books, news articles, scientific papers, etc), those models can perform a variety of NLP tasks when primed with only a handful of samples - no training in the classical sense (updating model weights) is required. For example, assuming that we have access to a set sentence pairs {s_i, t_i}, s_i being English sentences and t_i their translation into French, when prompted with the sequence of “s_1 in French means t_1 \n s_2 in French means t_2 \n … s_k in French means ” the autoregressive models are capable of producing (in an autoregressive manner) the correct French translation of English sentence s_k. It was shown that other classical textual tasks such as Summarization (The summary of {} is {}), or Question Answering (Question: {} Answer: {}) are also solvable, with the correct prompt. Surprisingly, it is being reported that with the correct prompt, the results are competitive when compared with fully-supervised models trained in a supervised manner, using the labeled data. It was shown, that given the correct prompt LMs can also do basic numerical reasoning. When prompted with a sequence of simple additions they are able to predict the correct value, e.g. “How much is 1+2? Answer: 3 \n How much 4+5 Answer: 9 … How much is 6+7 Answer: ” the model is able to predict the correct value of 13.

Our project is inspired by the recently published paper: “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity” by Lu et al. 2022. In this paper, the authors show that the order in which the samples are ordered in the prompt can make a difference between a random guess and near state-of-the-art performance. Using the Machine Translation example that we introduced before, they claim that the quality of the French translation can vary significantly when we prompt the model with “s_{k-1} in French means t_{k-1} \n s_12 in French means t_22 \n … s_k in French means ” instead of “s_1 in French means t_1 \n s_2 in French means t_2 \n … s_k in French means ”. In their experiments, they focus on classification tasks, such as sentiment classification or textual entailment.

The question we ask is whether this phenomenon also applies to mathematical expressions, i.e. whether the arithmetic operations in the space of language model prompts have the basic properties, such as the commutative property. One would expect that a system capable of conducting numerical reasoning would behave the same when prompted with “1+2+3” vs “2+3+1”. In addition, we plan to conduct a detailed analysis of the failed cases, trying to determine their reason.

Current state of knowledge:

The GPT-3 model (Brown et al. 2020) was tested for its ability to perform simple arithmetic operations. It was challenged to perform: 2,3,4 and 5-digit addition, 2,3,4 and 5-digit subtraction and 2-digit multiplication. Several experiments were reported, with digits sampled uniformly from the [0, 10^k) interval, for several values of k. The performance of the model was heavily dependent on the model size (number of trainable parameters) and the number of examples in the prompt. The largest variant with 175B parameters achieves 100% accuracy on 2-digit addition, 80% on 3-digit addition and only 25% on 4-digit addition. Reducing the number of examples in the prompt from 50 used in most experiments to 10 degraded the performance, especially for the more challenging tasks (4 digits addition: 25% → 14%). When smaller models are used, the accuracy drops significantly (~13% on 2 digits addition for a model with 6.7B parameters).

Besides reporting accuracy, no other fine-grained analysis of the failed cases was conducted. The only detailed analysis was done to check the presence of “<NUM1> + <NUM2> =“ and “<NUM1> plus <NUM2>” expressions in training data, with the findings being that less than 1% of equations used during testing could be found in the corpus, indicating that the performance is not due to the model memorizing the answers.

While previous works approached the problem of mathematical reasoning (Patel et al. 2021, Zhout et al. 2020), they focused on more advanced tasks such as unit conversion or elementary-school-level math quizzes. To the best of our knowledge, only a single work investigated in detail the autoregressive language models' performance on arithmetic tasks. Razeghi et al. (2022) analyzed the publicly available Pile dataset (Gao et el., 2020), a large-scale language modeling dataset consisting of English documents (800GB of text data) and the open-sourced GPT-J (6B parameters) and GPT-Neo (2 versions, 1.3B and 2.7B parameters) language models trained on it. Their findings indicate, that the model performance on the 2-digit addition task is heavily correlated with the frequency of both terms in the training data. They however do not consider the influence of other factors, such as digit order, and report only the accuracy of correct predictions.

. . .

[1] Language Models are Few-Shot Learners; NeurIPS Proceedings 2020; Tom Brown, Benjamin Mann at al.; https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

[2] Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. 2020. Temporal Common Sense Acquisition with Minimal Supervision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7579–7589, Online. Association for Computational Linguistics.

[3] Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP Models really able to Solve Simple Math Word Problems?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.

[4] Impact of Pretraining Term Frequencies on Few-Shot Reasoning; Published in arXiv; Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, Sameer Singh; https://arxiv.org/pdf/2202.07206v1.pdf

[5] https://huggingface.co/EleutherAI

Explanation of relations with other projects addressed by the Supervisor or Principal Researcher:

The Institute of Formal and Applied Linguistics has a long tradition of research in the field of computational linguistics and processing of text in natural languages.

Recently, the THEaiTRE project (TAČR, TL03000348, 2020-2022) explored the usage of autoregressive language models for the automatic generation of theater play scripts.
The supervisor is currently a project investigator of Horizon Europe projects MEMORISE (HE, 101061016, 2022-2026) and RES-Q+ (HE, 101057603, 2022-2026) as well as Horizon 2020 project WELCOME (H2020, 870930, 2020-2023) that all fall within the broader domain of NLP applications. The specific topic of the proposed project has not been explored so far; it will also fit very well into the research directions of the department and will contribute to its expertise.

Facilities at the project's disposal:

Having a cluster of about 1700 CPU cores and 100 GPUs, UFAL provides rich computational resources for performing research experiments. Additionally, UFAL provides rich linguistic databases (audio-visual corpora, Treebanks, etc.), as well as various tools (Automatic Speech Recognition, Machine Translation) for conducting and supporting the research.

Project's research objectives:

The first objective of this project is to experimentally determine whether the basic properties of arithmetic operations, such as the Commutative Law hold in the space of Large, Neural Language Models Prompts. We plan to measure, to what extent the ability of the model to conduct numerical reasoning changes (by e.g. measuring Accuracy, Mean Square Error of prediction) when some additional structure (that yields an equivalent equation from the purely mathematical point of view) is introduced into the input.
The second objective is to conduct a quantitative and qualitative analysis of the failed predictions - previous works reported only the binary accuracy of perfect predictions. To properly understand the model behavior, and make it more robust to input perturbations, methods from the Explainable AI (XAI) will be used. They enable e.g. highlighting a sub-sequence of tokens that had the greatest influence on the model prediction.
The third objective is to explore whether the findings hold between different datasets used for pre-training and whether increasing the number of trainable parameters in the Langauge Model will make the model more robust to input perturbations (equivalently, whether the additional structure introduced in the input helps with predicting the correct result). We plan on doing that by exploring the publicly available Language Models such as GPT-J or GPT-Neo.

Methods of research:

There is no dedicated dataset for arithmetic operation modeling. Due to the nature of the task, all previous works generate data on-the-fly, by sampling digits from a given distribution (left side of the equation) and simply performing the operation (addition, subtraction, multiplication) to obtain the right side. We will obtain the data following the same procedure.

Concerning the autoregressive language models that we can use in our experiments, we plan to use the open-source and publicly available GPT-J and GPT-Neo models. This allows us to test the effect of the model size (1.3B, 2.7B, or 6B parameters) on the performance.
For comparison (after fine-tuning inference parameters, using the open-sourced models) we plan to use also the GPT-3 models available via the API hosted by OpenAI. With the price of roughly 2 USD/1M tokens, inference on a reasonably sized dataset (2k equations) is a neglectable cost.

We plan to examine how the ordering of digits within the equations in prompt influences models' ability to conduct arithmetic operations. We also plan to examine different kinds of ordering, e.g. numerical vs lexicographic one. While it’s natural for a human to sort strings corresponding to numbers like numbers, it might be the case that Language Models will be more affected by lexicographic ordering. Furthermore, we plan to report different metrics besides accuracy, namely Mean Square Error (MSE) and correlation coefficient. This will allow us for a more fine-grained analysis - previous results suggest that smaller LMs can’t handle digit multiplication very well, but we don’t really know what kinds of errors are they making.

Finally, we plan to examine the effect of digit frequency not only in training data but also in vocabulary, which is used to tokenize the input. If the tokenizer does not split every number into a sequence of digits, as one would expect (123 + 1 → 1 <s> 2 <s> 3 <s> + <s> 1), then it can have a non-trivial influence. When a 3-digit multiplication is performed, the output can have up to 6 digits, and considering the autoregressive nature, generating 6 tokens (1 token per digit) would be much more difficult than generating e.g. 2 tokens (3 digits per token).

Presentation of results:

The results of the conducted research, which is described in this proposal, will be submitted and presented at “The 2nd Workshop of Mathematical Natural Langauge Processing” to be hosted in 2023 at one of the ACL conferences (ACL/EMNLP/AACL-IJCNLP).
Other possible venues include: “The 7th Workshop on Structured Prediction for NLP” or “Workshop on Challenges & Perspectives in Creating Large Language Models”, both hosted in 2023 at ACL conferences.

Attachments


[ Back to the navigation ] [ Back to the content ]