Just asked o1 this question...
9.11 and 9.9, which number is larger?
Very interesting response from o1-mini.
It self-corrected itself midway.
While the first half of o1-mini answer was wrong, it (unusually) self-corrected midway in its first response in this brand new blank conversation. And it was correct at the end, while it also discreetly
brought up version-number context (an unexpected self-excuse by o1-mini) -- in software version-number context, 9.11 is bigger than 9.9 since version 9.11.0 is bigger than 9.9.0. So that's probably a bit more "astute" than other GPT's.
MacOS X went from 10.9 to 10.10 to 10.11 after all.
YMMV (answers vary every time you ask, I'm not going to waste o1 credits as only 30 queries/week for o1 and 50 queries/week for o1-mini).
But its training set was probably internally contaminated by the tokenization behaviors (good point, ibad) AND version-vs-maths contexts (good point, o1-mini).
Conversely, I would surmise that, LLM training priority was also more programming-heavy than math-heavy, amplifying weights towards the programmer-versioning number behaviors compounding on tokenization behaviors, to create this interesting little mis-recall.
Unlike the 10+ different numeral systems (including non-Arabic numbers), there's only one Python language (and minor variances over the several versions). So LLM weights will be unusually amplified by the consistency of programming languages compared to the grammar quirks of English. And even mathematics on a base-number invented by human finger count, and those more superfically (e.g. roman numerals). The possible weights overstrength on programming improvements may have leaked version-numbering-priority behavior over math-priority behavior in numeric comparison.
Humans who are trained on arabic numbers, suddenly asked questions in a different numeric language, often make similar errors too.
To an LLM, common arabic numbers are probably more "foreign" to its architecture than roman numbers are to the average Ars commetariat.
Imagine having to do roman numerals, and forced to quiz-league (3 second timer) recall quickly whether a specific roman numeral is greater than another roman numeral.
It seems that we are apparently asking an LLM a question more difficult than "Is MMLXVI smaller or bigger than MCMLXXXVIII?" question when it comes to the foreigness of an arbitrary human number system that crosswires with programming weights and other-language weights. In a situation where you never knew roman numbers was going to be part of the time-limited pop quiz. Decimal arabic numbers aren't LLMs' "native" language, after all -- and the training set has to weight towards the corpus of the entirety of humankind knowledge, in the pressures to make LLM's "smarter" with the fewest parameter counts.
The optimized variants of GPT4-series still has fewer parameters than the synapse-equivalents of a house-mouse (~1T), so necessarily, this is a gigantic optimization job to maximally simulate "intelligence" in the minimum possible silicon (parametric count, tokenization system, bits per weight, etc). It's astounding we got "this" much "intelligence" at less than 2T parameters total (Across all of it multimodals), and as a distilled/optimized descendant of GPT4, the o1 series probably has approx 1/10th that (~200B), in order to necessarily make its "thinking" loop more efficient.
When I think this through, it makes sense that it's "trying" to do its best with its "compromise" of its LLMs weights -- weights being a defacto crosswiring between programmer version numbers, between different languages of numbers, and between tokenization quirks. To an LLM, arabic numbers are just 1 of >10+ vaguely familiar numeric systems to a best-effort polyglot that may screw up on some languages.
Even within the same numeric language (arabic numbers) it is not inconceivable programmers may cross-wire their version-numbering rote memory with their mathematic rote memory, when task-switching rapidly between mathematical-greaters and versioning-greaters multiple times per hour. I've even b0rk3d up a version-number-comparision function (is version X.Y greater than X.Y?") in C/C++ by accident because I was programming out of my rote memory (Level 1 behavior), but realized my mistake quickly once I saw the debug console output.
At least, that particular hallucination is more like a "mis-recall" that is (mostly) explainable from two very clear fronts (1) tokenization, and (2) math-vs-versioning, and a third probable contributory front (3) arabic numbers aren't LLMs native language (like roman numerals generally arent to us anymore).