Both sides previous revision
Previous revision
|
|
user:krubinski:gauk [2023/05/08 12:42] krubinski [Presentation of results:] |
user:krubinski:gauk [2024/09/25 00:11] (current) krubinski |
* Pavel Pecina CV.pdf (Supervisor's CV) | * Pavel Pecina CV.pdf (Supervisor's CV) |
* Mateusz Krubiński CV.pdf (Principal researcher's CV) | * Mateusz Krubiński CV.pdf (Principal researcher's CV) |
| |
| |
| ---- |
| |
| |
| ===== THE FINAL REPORT ON GAUK ===== |
| |
| ---- |
| |
| |
| ==== Report on research progress for the last year ==== |
| During the last year (which was also the first year of the project), we conducted a number of experiments in order to experimentally determine whether the arithmetic properties hold in the space of language model prompts, i.e., whether "Does asking the LLM to compute 13 + 37 vs 37 + 13 result, on average, in the same outcome?". The outcome of those experiments is a research paper published at the peer-reviewed workshop, namely the "MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI" workshop co-located with the NeurIPS 2023 conference in New Orleans, USA. At the beginning of 2023, we prepared code for generating the arithmetic prompts and developed scripts for inference with both open-source LLMs (on the UFAL cluster), and proprietary ones (OpenAI API). In the following months, we performed a number of experiments (using smaller LLMs from the GPT-J/GPT-Neo family, to enable faster experiments) in order to determine which properties are affecting the predictions of LLMs, and which are not. We were able to identify that i.e., the symbol used to indicate addiction ("+" vs "plus") doesn't affect the outcome in a substantial manner, but the distribution of digits in the prompt does. Once we had some intuition, we repeated the experiments with larger open-source LLMs (OPT family), and with proprietary ones accessible via paid API (the cost was covered by the institute from a dedicated grant). In the second half of 2023, we refined the findings and worked on the manuscript. The paper was submitted in September. Once we received the notification of acceptance in late October, we spent two weeks in November working on the camera-ready version - polishing the text, and adding experiments (LLaMA models and fine-tuning experiments), as requested by the reviewers. In December, the work was presented as a poster, in person, at the workshop co-located with NeurIPS 2023. |
| |
| ==== Fulfillment of the project's objectives ==== |
| As a formal introduction: we will refer to the first part of the prompt (the one with examples) as a template and to the second (the one we wish to complete) as a query. The first project objective was to experimentally determine whether the basic properties of arithmetic operations hold in the space of Large, Neural Language Models Prompts. In our experiments, we explored the commutative, associative, and distributive properties by prompting several LLMs (from 2.7 billion to 175 billion parameters) from several families (OPT [Zhang et al., 2022], GPT [OpenAI, 2023] and LLaMA/LLaMA 2 [Touvron et al., 2023]). We measured how changing the template to a more/less structured one affects the average outcomes for certain queries. Our results were indefinite – while the number ordering does not influence the largest GPT model, including the parentheses may influence the tokenization and, thus, performance. Since the smaller models performed much worse on the tasks we considered (2-digit addition, 2-digit multiplication, and 3-digit addition), we could not formulate decisive findings. The second project objective was to conduct a quantitative and qualitative analysis of the failed predictions. As a part of the quantitative analysis, we measured not only Accuracy (fraction of correct predictions) but also looked at the (average) square error and the correlation between the correct numerical results and model outputs. While those metrics mostly agreed (higher Accuracy = smaller error of prediction), we identified some cases when it was not true. We approached the qualitative analysis by examining the number of tokens required to encode a particular math expression. We realized that the currently used tokenizer could produce different outputs if we switch the position of digit and parentheses, i.e., "(27 " vs "27)". The third project objective was to explore whether the findings hold between different datasets used for pre-training and whether increasing the number of trainable parameters in the Language Model will make the model more robust to input perturbations. Since we explored off-the-shelf LLMs, we could not explicitly determine the effect of a dataset used for pre-training. We did, however, approach this by fine-tuning two open-source LLMs on an artificial dataset of math expressions and experimentally proved that by doing so, we could improve the Accuracy by a large margin. In our experiments, we explored several models (from 2.7 billion to 175 billion parameters), and plotted the performance against size, showing that larger models are indeed more robust. By exploring two consecutive versions of the same model (LLaMA vs LLaMA 2), we also showed that some improvements could be indeed attributed to the training data - those models have the same number of trainable parameters and use the same tokenizer. |
| |
| |
| |
| ==== Final report ==== |
| We fulfilled all of the initial project's objectives, as described in the dedicated section of this report ("Fulfilment of the project's objectives"). An outcome of the project is a peer-reviewed publication, published at a relevant workshop, that acknowledges support from GAUK. Since the project topic is not connected to the main research project/topic of the principal investigator, other papers that we have published in 2023 do not acknowledge GAUK. We have spent the allocated budget according to the initial plan, with one minor adjustment (2,000 CZK were moved from "Other non-investment costs" to "Travel costs"). Since the project was granted for a single year, the whole research progress is described in detail in the "Report on research progress for the last year" section. We would like to express our gratitude to the Charles University Grant Agency for funding this project. |
| |
| |
| |
| |
| ==== Commentary regarding used-up funds ==== |
| Funds allocated to "Indirect costs", "Scholarships", and "Salaries" were spent according to the financial requirements, i.e., to fund a stipend for the principal researcher and to fund a salary for the project supervisor. A budget dedicated to "Other non-investment costs" was allocated to increase funds for travel - 37k CZK was not enough to cover all of the costs related to conference participation. In December 2023, I presented (as a poster) a paper "Basic Arithmetic Properties in the Space of Language Model Prompts" at the "MATH-AI: The 3rd Workshop on Mathematical Reasoning and AI" workshop co-located with NeurIPS conference (core rank: A+) in New Orleans, USA. The exact spendings were as follows: conference registration fee: 475 USD = 11,290.11 CZK, plane ticket Miami-Prague: 12,274.31 CZK (I paid for the ticket from New Orleans to Miami myself, as I spent some days off in Miami) and accommodation in New Orleans: 15,435.58 CZK, which all add up to 39k CZK. Flight to New Orleans and travel/meal allowance costs were covered by CELSA project no. 19/018, from which I was co-funded. |
| |
| |
| |