user:krubinski:gauk [ufal wiki]

GAUK – Mateusz Krubiński

Basic information about project No. 291923

Czech title of project:	Aritmetické Vlastnosti v prostoru výzev Jazykového Modelu
English title of project:	Arithmetic Properties in the space of Language Model Prompts
Current principal researcher:	Mgr. Mateusz Krubiński
First applicant:	Mateusz Krubiński
Study programme:	Matematicko-fyzikální fakulta
Programme:	Computational linguistics
Field:	Computational linguistics
Type of study programme:	PhD programme
Academic department / institute:	Ústav formální a aplikované lingvistiky
Year of project inception:	2022
Duration of project:	1
Interdisciplinarity:	no
Workplace	Institute of Formal and Applied Linguistics
Area board section:	Social sciences - Computer science (INF)

Research team

Mgr. Mateusz Krubiński	Scholarships	80/80
doc. RNDr. Pavel Pecina Ph.D	Work Assignment Agreement	13/10

Description of research team - year 2023:

The research team will consist of a Ph.D. Student, Mateusz Krubiński, and his supervisor doc. RNDr. Pavel Pecina, Ph.D.

Mateusz Krubiński is the main investigator and a third-year Ph.D. Student studying Computational Linguistics at the Institute of Formal and Applied Linguistics.
His dissertation is in the area of multi-modal approaches to Natural Language Processing.
Prior to joining UFAL, he worked for two years as an NLP Engineer at the R&D department in Poland. Before that, he graduated with a master’s degree in Mathematics from the Warsaw University of Technology. He is the first author of 4 peer-reviewed publications (and 5th one under review) in the area of Machine Translation and Summarization, published at the relevant conferences. During his studies, he worked as an Applied Scientist Intern at the Amazon Development Center in Bucharest, Romania. He will be responsible for the implementation of the planned experiments.
His CV is attached.

Doc. RNDr. Pavel Pecina Ph.D. is an associate professor working in the area of Computational Linguistics at the Institute of Formal and Applied Linguistics at the Faculty of Mathematics and Physics, Charles University. His research interests include machine translation, information retrieval and multimodal data interpretation. He has international experience as a post-doc in the Machine Translation group of the Centre for Next Generation Localisation, Dublin City University, Ireland, as an intern with the Natural Language Processing group at Microsoft Research, Redmond, USA, and as a visiting student at the Center for Language and Speech Processing at the Johns Hopkins University, Baltimore, USA. He is the author of more than 100 peer-reviewed conference papers and journal articles. Scopus indexes 81 of his papers, with 793 citations (h-index=17). He was Co-PI of a Center of Excellence project CEMI (2012-2018) funded by the Czech Science Foundation and focusing on multimodal data interpretation, Co-PI of the EU H2020 projects KConnect (2015-2017), Welcome (2020-2023), MEMORISE (2022-2026) and RES-Q (2022-2026). He will supervise the research conducted within the proposed project. He will also assist with conference presentations and management of the project. His CV, including a list of selected papers, is attached.

Financial requirements

Item	Year 2023
Other non-investment costs náklady	2/2
Travel costs	37/37
Indirect costs	19/19
Personnel costs (salaries) and stipends	93/90
Total	151 / 148

Structure of financial requirements - year 2023

The salaries and stipends are in line with the requirements of GAUK and the university salary rules. The principal researcher (Mgr Mateusz Krubiński), will receive the funding of CZK 80,000 for the work on the project, analysis, implementation, measurements and final presentation of results at conferences. The project supervisor (doc. RNDr. Pavel Pecina, Ph.D.) will receive a salary of CZK 13,000 for his professional consultations and assessment of prepared papers. Other non-investment expenses ( 2,000 Kč annually) cover the purchase of books, stationery, and other small office materials.
Travel and presentation expenses (37,000 Kč) will be used for in-person attendance at one of the relevant conferences, such as:

– ACL 2023 (Toronto, Canada), estimates based on previous years:
Conference fee: 6000 Kč (250$)
Travel costs:
airplane ticket: 16,000 Kč (round-trip Prague/Toronto with 1 transfer)
accommodation: 10,000 Kč (5 nights, roughly 80$ per night)
per diem: 5,000 Kč (5 days, 40$ per day)
travel insurance: 800 Kč

– EMNLP 2023 (Singapore), estimates based on previous years:
Conference fee: 8000 Kč (325$)
Travel costs:
airplane ticket: 21,000 Kč (round-trip Prague/Singapore with 1 transfer)
accommodation: 3,500 Kč (5 nights, roughly 30$ per night)
per diem: 3,500 Kč (5 days, 30$ per day)
travel insurance: 800 Kč

Financial outlook for following years

N/A

Additional information

Summary:

Large, pre-trained neural language models (LM) that can effectively utilize enormous amounts of unlabeled, textual data have recently changed the whole field of Natural Language Processing (NLP).

One can categorize them into three classes: autoregressive language models (e.g. GPT/GPT-2/GPT-3), masked language models (e.g. BERT), and encoder-decoder models (e.g. T5/BERT).

All of them are trained on sequences of tokens sampled from textual data, and during training seeing hundreds of billions of tokens. Due to the unsupervised nature of their learning process, it is referred to as “pre-training”, reserving the word “training” for the supervised, downstream tasks to which they are applied.

In this project, we would like to focus on the autoregressive language models. During the pre-training, they are tasked to predict the next token x_i, given previous tokens x_0, x_1, …, x_{i-1}. This is realized with the training objective of minimizing the log-likelihood, conditioned on the previous tokens and model parameters.
It was observed, that thanks to the variety of textual data seen during training (books, news articles, scientific papers, etc), those models can perform a variety of NLP tasks when primed with only a handful of samples - no training in the classical sense (updating model weights) is required. For example, assuming that we have access to a set sentence pairs {s_i, t_i}, s_i being English sentences and t_i their translation into French, when prompted with the sequence of “s_1 in French means t_1 \n s_2 in French means t_2 \n … s_k in French means ” the autoregressive models are capable of producing (in an autoregressive manner) the correct French translation of English sentence s_k. It was shown that other classical textual tasks such as Summarization (The summary of {} is {}), or Question Answering (Question: {} Answer: {}) are also solvable, with the correct prompt. Surprisingly, it is being reported that with the correct prompt, the results are competitive when compared with fully-supervised models trained in a supervised manner, using the labeled data. It was shown, that given the correct prompt LMs can also do basic numerical reasoning. When prompted with a sequence of simple additions they are able to predict the correct value, e.g. “How much is 1+2? Answer: 3 \n How much 4+5 Answer: 9 … How much is 6+7 Answer: ” the model is able to predict the correct value of 13.

Our project is inspired by the recently published paper: “Fantastically Ordered Prompts and Where to Find Them: Overcoming Few-Shot Prompt Order Sensitivity” by Lu et al. 2022. In this paper, the authors show that the order in which the samples are ordered in the prompt can make a difference between a random guess and near state-of-the-art performance. Using the Machine Translation example that we introduced before, they claim that the quality of the French translation can vary significantly when we prompt the model with “s_{k-1} in French means t_{k-1} \n s_12 in French means t_22 \n … s_k in French means ” instead of “s_1 in French means t_1 \n s_2 in French means t_2 \n … s_k in French means ”. In their experiments, they focus on classification tasks, such as sentiment classification or textual entailment.

The question we ask is whether this phenomenon also applies to mathematical expressions, i.e. whether the arithmetic operations in the space of language model prompts have the basic properties, such as the commutative property. One would expect that a system capable of conducting numerical reasoning would behave the same when prompted with “1+2+3” vs “2+3+1”. In addition, we plan to conduct a detailed analysis of the failed cases, trying to determine their reason.

Current state of knowledge:

The GPT-3 model (Brown et al. 2020) was tested for its ability to perform simple arithmetic operations. It was challenged to perform: 2,3,4 and 5-digit addition, 2,3,4 and 5-digit subtraction and 2-digit multiplication. Several experiments were reported, with digits sampled uniformly from the [0, 10^k) interval, for several values of k. The performance of the model was heavily dependent on the model size (number of trainable parameters) and the number of examples in the prompt. The largest variant with 175B parameters achieves 100% accuracy on 2-digit addition, 80% on 3-digit addition and only 25% on 4-digit addition. Reducing the number of examples in the prompt from 50 used in most experiments to 10 degraded the performance, especially for the more challenging tasks (4 digits addition: 25% → 14%). When smaller models are used, the accuracy drops significantly (~13% on 2 digits addition for a model with 6.7B parameters).

Besides reporting accuracy, no other fine-grained analysis of the failed cases was conducted. The only detailed analysis was done to check the presence of “<NUM1> + <NUM2> =“ and “<NUM1> plus <NUM2>” expressions in training data, with the findings being that less than 1% of equations used during testing could be found in the corpus, indicating that the performance is not due to the model memorizing the answers.

While previous works approached the problem of mathematical reasoning (Patel et al. 2021, Zhout et al. 2020), they focused on more advanced tasks such as unit conversion or elementary-school-level math quizzes. To the best of our knowledge, only a single work investigated in detail the autoregressive language models' performance on arithmetic tasks. Razeghi et al. (2022) analyzed the publicly available Pile dataset (Gao et el., 2020), a large-scale language modeling dataset consisting of English documents (800GB of text data) and the open-sourced GPT-J (6B parameters) and GPT-Neo (2 versions, 1.3B and 2.7B parameters) language models trained on it. Their findings indicate, that the model performance on the 2-digit addition task is heavily correlated with the frequency of both terms in the training data. They however do not consider the influence of other factors, such as digit order, and report only the accuracy of correct predictions.

. . .

[1] Language Models are Few-Shot Learners; NeurIPS Proceedings 2020; Tom Brown, Benjamin Mann at al.; https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf

[2] Ben Zhou, Qiang Ning, Daniel Khashabi, and Dan Roth. 2020. Temporal Common Sense Acquisition with Minimal Supervision. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7579–7589, Online. Association for Computational Linguistics.

[3] Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. Are NLP Models really able to Solve Simple Math Word Problems?. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080–2094, Online. Association for Computational Linguistics.

[4] Impact of Pretraining Term Frequencies on Few-Shot Reasoning; Published in arXiv; Yasaman Razeghi, Robert L. Logan IV, Matt Gardner, Sameer Singh; https://arxiv.org/pdf/2202.07206v1.pdf

[5] https://huggingface.co/EleutherAI

Explanation of relations with other projects addressed by the Supervisor or Principal Researcher:

The Institute of Formal and Applied Linguistics has a long tradition of research in the field of computational linguistics and processing of text in natural languages.

Recently, the THEaiTRE project (TAČR, TL03000348, 2020-2022) explored the usage of autoregressive language models for the automatic generation of theater play scripts.
The supervisor is currently a project investigator of Horizon Europe projects MEMORISE (HE, 101061016, 2022-2026) and RES-Q+ (HE, 101057603, 2022-2026) as well as Horizon 2020 project WELCOME (H2020, 870930, 2020-2023) that all fall within the broader domain of NLP applications. The specific topic of the proposed project has not been explored so far; it will also fit very well into the research directions of the department and will contribute to its expertise.

Facilities at the project's disposal:

Having a cluster of about 1700 CPU cores and 100 GPUs, UFAL provides rich computational resources for performing research experiments. Additionally, UFAL provides rich linguistic databases (audio-visual corpora, Treebanks, etc.), as well as various tools (Automatic Speech Recognition, Machine Translation) for conducting and supporting the research.

Project's research objectives:

The first objective of this project is to experimentally determine whether the basic properties of arithmetic operations, such as the Commutative Law hold in the space of Large, Neural Language Models Prompts. We plan to measure, to what extent the ability of the model to conduct numerical reasoning changes (by e.g. measuring Accuracy, Mean Square Error of prediction) when some additional structure (that yields an equivalent equation from the purely mathematical point of view) is introduced into the input.
The second objective is to conduct a quantitative and qualitative analysis of the failed predictions - previous works reported only the binary accuracy of perfect predictions. To properly understand the model behavior, and make it more robust to input perturbations, methods from the Explainable AI (XAI) will be used. They enable e.g. highlighting a sub-sequence of tokens that had the greatest influence on the model prediction.
The third objective is to explore whether the findings hold between different datasets used for pre-training and whether increasing the number of trainable parameters in the Langauge Model will make the model more robust to input perturbations (equivalently, whether the additional structure introduced in the input helps with predicting the correct result). We plan on doing that by exploring the publicly available Language Models such as GPT-J or GPT-Neo.

Methods of research:

There is no dedicated dataset for arithmetic operation modeling. Due to the nature of the task, all previous works generate data on-the-fly, by sampling digits from a given distribution (left side of the equation) and simply performing the operation (addition, subtraction, multiplication) to obtain the right side. We will obtain the data following the same procedure.

Concerning the autoregressive language models that we can use in our experiments, we plan to use the open-source and publicly available GPT-J and GPT-Neo models. This allows us to test the effect of the model size (1.3B, 2.7B, or 6B parameters) on the performance.
For comparison (after fine-tuning inference parameters, using the open-sourced models) we plan to use also the GPT-3 models available via the API hosted by OpenAI. With the price of roughly 2 USD/1M tokens, inference on a reasonably sized dataset (2k equations) is a neglectable cost.

We plan to examine how the ordering of digits within the equations in prompt influences models' ability to conduct arithmetic operations. We also plan to examine different kinds of ordering, e.g. numerical vs lexicographic one. While it’s natural for a human to sort strings corresponding to numbers like numbers, it might be the case that Language Models will be more affected by lexicographic ordering. Furthermore, we plan to report different metrics besides accuracy, namely Mean Square Error (MSE) and correlation coefficient. This will allow us for a more fine-grained analysis - previous results suggest that smaller LMs can’t handle digit multiplication very well, but we don’t really know what kinds of errors are they making.

Finally, we plan to examine the effect of digit frequency not only in training data but also in vocabulary, which is used to tokenize the input. If the tokenizer does not split every number into a sequence of digits, as one would expect (123 + 1 → 1 <s> 2 <s> 3 <s> + <s> 1), then it can have a non-trivial influence. When a 3-digit multiplication is performed, the output can have up to 6 digits, and considering the autoregressive nature, generating 6 tokens (1 token per digit) would be much more difficult than generating e.g. 2 tokens (3 digits per token).

Presentation of results:

The results of the conducted research, which is described in this proposal, will be submitted and presented at “The 2nd Workshop of Mathematical Natural Langauge Processing” to be hosted in 2023 at one of the ACL conferences (ACL/EMNLP/AACL-IJCNLP).
Other possible venues include: “The 7th Workshop on Structured Prediction for NLP” or “Workshop on Challenges & Perspectives in Creating Large Language Models”, both hosted in 2023 at ACL conferences.

Attachments

Pavel Pecina CV.pdf (Supervisor's CV)
Mateusz Krubiński CV.pdf (Principal researcher's CV)

THE FINAL REPORT ON GAUK

Report on research progress for the last year

During the last year (which was also the first year of the project), we conducted a number of experiments in order to experimentally determine whether the arithmetic properties hold in the space of language model prompts, i.e., whether “Does asking the LLM to compute 13 + 37 vs 37 + 13 result, on average, in the same outcome?”. The outcome of those experiments is a research paper published at the peer-reviewed workshop, namely the “MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI” workshop co-located with the NeurIPS 2023 conference in New Orleans, USA. At the beginning of 2023, we prepared code for generating the arithmetic prompts and developed scripts for inference with both open-source LLMs (on the UFAL cluster), and proprietary ones (OpenAI API). In the following months, we performed a number of experiments (using smaller LLMs from the GPT-J/GPT-Neo family, to enable faster experiments) in order to determine which properties are affecting the predictions of LLMs, and which are not. We were able to identify that i.e., the symbol used to indicate addiction (“+” vs “plus”) doesn't affect the outcome in a substantial manner, but the distribution of digits in the prompt does. Once we had some intuition, we repeated the experiments with larger open-source LLMs (OPT family), and with proprietary ones accessible via paid API (the cost was covered by the institute from a dedicated grant). In the second half of 2023, we refined the findings and worked on the manuscript. The paper was submitted in September. Once we received the notification of acceptance in late October, we spent two weeks in November working on the camera-ready version - polishing the text, and adding experiments (LLaMA models and fine-tuning experiments), as requested by the reviewers. In December, the work was presented as a poster, in person, at the workshop co-located with NeurIPS 2023.

Fulfillment of the project's objectives

As a formal introduction: we will refer to the first part of the prompt (the one with examples) as a template and to the second (the one we wish to complete) as a query. The first project objective was to experimentally determine whether the basic properties of arithmetic operations hold in the space of Large, Neural Language Models Prompts. In our experiments, we explored the commutative, associative, and distributive properties by prompting several LLMs (from 2.7 billion to 175 billion parameters) from several families (OPT [Zhang et al., 2022], GPT [OpenAI, 2023] and LLaMA/LLaMA 2 [Touvron et al., 2023]). We measured how changing the template to a more/less structured one affects the average outcomes for certain queries. Our results were indefinite – while the number ordering does not influence the largest GPT model, including the parentheses may influence the tokenization and, thus, performance. Since the smaller models performed much worse on the tasks we considered (2-digit addition, 2-digit multiplication, and 3-digit addition), we could not formulate decisive findings. The second project objective was to conduct a quantitative and qualitative analysis of the failed predictions. As a part of the quantitative analysis, we measured not only Accuracy (fraction of correct predictions) but also looked at the (average) square error and the correlation between the correct numerical results and model outputs. While those metrics mostly agreed (higher Accuracy = smaller error of prediction), we identified some cases when it was not true. We approached the qualitative analysis by examining the number of tokens required to encode a particular math expression. We realized that the currently used tokenizer could produce different outputs if we switch the position of digit and parentheses, i.e., “(27 ” vs “27)”. The third project objective was to explore whether the findings hold between different datasets used for pre-training and whether increasing the number of trainable parameters in the Language Model will make the model more robust to input perturbations. Since we explored off-the-shelf LLMs, we could not explicitly determine the effect of a dataset used for pre-training. We did, however, approach this by fine-tuning two open-source LLMs on an artificial dataset of math expressions and experimentally proved that by doing so, we could improve the Accuracy by a large margin. In our experiments, we explored several models (from 2.7 billion to 175 billion parameters), and plotted the performance against size, showing that larger models are indeed more robust. By exploring two consecutive versions of the same model (LLaMA vs LLaMA 2), we also showed that some improvements could be indeed attributed to the training data - those models have the same number of trainable parameters and use the same tokenizer.

Final report

We fulfilled all of the initial project's objectives, as described in the dedicated section of this report (“Fulfilment of the project's objectives”). An outcome of the project is a peer-reviewed publication, published at a relevant workshop, that acknowledges support from GAUK. Since the project topic is not connected to the main research project/topic of the principal investigator, other papers that we have published in 2023 do not acknowledge GAUK. We have spent the allocated budget according to the initial plan, with one minor adjustment (2,000 CZK were moved from “Other non-investment costs” to “Travel costs”). Since the project was granted for a single year, the whole research progress is described in detail in the “Report on research progress for the last year” section. We would like to express our gratitude to the Charles University Grant Agency for funding this project.

Commentary regarding used-up funds

Funds allocated to “Indirect costs”, “Scholarships”, and “Salaries” were spent according to the financial requirements, i.e., to fund a stipend for the principal researcher and to fund a salary for the project supervisor. A budget dedicated to “Other non-investment costs” was allocated to increase funds for travel - 37k CZK was not enough to cover all of the costs related to conference participation. In December 2023, I presented (as a poster) a paper “Basic Arithmetic Properties in the Space of Language Model Prompts” at the “MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI” workshop co-located with NeurIPS conference (core rank: A+) in New Orleans, USA. The exact spendings were as follows: conference registration fee: 475 USD = 11,290.11 CZK, plane ticket Miami-Prague: 12,274.31 CZK (I paid for the ticket from New Orleans to Miami myself, as I spent some days off in Miami) and accommodation in New Orleans: 15,435.58 CZK, which all add up to 39k CZK. Flight to New Orleans and travel/meal allowance costs were covered by CELSA project no. 19/018, from which I was co-funded.

[ Back to the navigation ] [ Back to the content ]

Institute of Formal and Applied Linguistics Wiki

Table of Contents