Papers
arxiv:2409.17391

Scaling Behavior for Large Language Models regarding Numeral Systems: An Example using Pythia

Published on Sep 25
Authors:
,
,

Abstract

Though Large Language Models (LLMs) have shown remarkable abilities in mathematics reasoning, they are still struggling with performing numeric operations accurately, such as addition and multiplication. Numbers can be tokenized into tokens in various ways by different LLMs and affect the numeric operations performance. Currently, there are two representatives: 1) Tokenize into 1-digit, and 2) Tokenize into 1sim 3 digit. The difference is roughly equivalent to using different numeral systems (namely base 10 or base 10^{3}). In light of this, we study the scaling behavior of different numeral systems in the context of transformer-based large language models. We empirically show that a base 10 system is consistently more data-efficient than a base 10^{2} or 10^{3} system across training data scale, model sizes under from-scratch training settings, while different number systems have very similar fine-tuning performances. We attribute this to higher token frequencies of a base 10 system. Additionally, we reveal extrapolation behavior patterns on addition and multiplication. We identify that base 100 and base 1000 systems struggle on token-level discernment and token-level operations. We also sheds light on the mechanism learnt by the models.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2409.17391 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2409.17391 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2409.17391 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.