OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

Psyborgue · Sep 15, 2024

corvaja72 said:
I guess (still) that not understanding context (as in what comes after an effect line, which usually, does not set and/or clarify the boundaries of the underlining line of thought) and spouting out word salad + links, is not a strict characteristic of LLMs.

You could directly respond, you know. And I don't actually find it insulting to be compared to a language model. The output is coherent, the links are relevant, code compiles, tests pass, tools work, and you're upset at the implications so you resort to crude insult.

corvaja72 · Sep 15, 2024

Frankly, I did find it insulting to fly over what someone has written beyond an initial catch line, which tries and expands on what is intended for "deciding", where choices are weighted depending on the understanding of the context and, eventually, asking for further clarification if that context is ambiguous, instead of just weighting peaks of correlation functions and the like to come with the most appropriate statistical outcome.
Anyways, just because of what is written above, I am able to evaluate that the threshold of what can be considered objectively insulting and what can be interpreted as insulting differ considerably among people who do not know each other (i.e. the context), so I decide to sincerely apologize for being or sounding rude.

mdrejhon · Sep 15, 2024

ibad said:
Just asked o1 this question...
9.11 and 9.9, which number is larger?

Very interesting response from o1-mini.
It self-corrected itself midway.

While the first half of o1-mini answer was wrong, it (unusually) self-corrected midway in its first response in this brand new blank conversation. And it was correct at the end, while it also discreetly brought up version-number context (an unexpected self-excuse by o1-mini) -- in software version-number context, 9.11 is bigger than 9.9 since version 9.11.0 is bigger than 9.9.0. So that's probably a bit more "astute" than other GPT's.

MacOS X went from 10.9 to 10.10 to 10.11 after all.

YMMV (answers vary every time you ask, I'm not going to waste o1 credits as only 30 queries/week for o1 and 50 queries/week for o1-mini).

But its training set was probably internally contaminated by the tokenization behaviors (good point, ibad) AND version-vs-maths contexts (good point, o1-mini).

Conversely, I would surmise that, LLM training priority was also more programming-heavy than math-heavy, amplifying weights towards the programmer-versioning number behaviors compounding on tokenization behaviors, to create this interesting little mis-recall.

Unlike the 10+ different numeral systems (including non-Arabic numbers), there's only one Python language (and minor variances over the several versions). So LLM weights will be unusually amplified by the consistency of programming languages compared to the grammar quirks of English. And even mathematics on a base-number invented by human finger count, and those more superfically (e.g. roman numerals). The possible weights overstrength on programming improvements may have leaked version-numbering-priority behavior over math-priority behavior in numeric comparison.

Humans who are trained on arabic numbers, suddenly asked questions in a different numeric language, often make similar errors too.

To an LLM, common arabic numbers are probably more "foreign" to its architecture than roman numbers are to the average Ars commetariat. Imagine having to do roman numerals, and forced to quiz-league (3 second timer) recall quickly whether a specific roman numeral is greater than another roman numeral.

It seems that we are apparently asking an LLM a question more difficult than "Is MMLXVI smaller or bigger than MCMLXXXVIII?" question when it comes to the foreigness of an arbitrary human number system that crosswires with programming weights and other-language weights. In a situation where you never knew roman numbers was going to be part of the time-limited pop quiz. Decimal arabic numbers aren't LLMs' "native" language, after all -- and the training set has to weight towards the corpus of the entirety of humankind knowledge, in the pressures to make LLM's "smarter" with the fewest parameter counts.

The optimized variants of GPT4-series still has fewer parameters than the synapse-equivalents of a house-mouse (~1T), so necessarily, this is a gigantic optimization job to maximally simulate "intelligence" in the minimum possible silicon (parametric count, tokenization system, bits per weight, etc). It's astounding we got "this" much "intelligence" at less than 2T parameters total (Across all of it multimodals), and as a distilled/optimized descendant of GPT4, the o1 series probably has approx 1/10th that (~200B), in order to necessarily make its "thinking" loop more efficient.

When I think this through, it makes sense that it's "trying" to do its best with its "compromise" of its LLMs weights -- weights being a defacto crosswiring between programmer version numbers, between different languages of numbers, and between tokenization quirks. To an LLM, arabic numbers are just 1 of >10+ vaguely familiar numeric systems to a best-effort polyglot that may screw up on some languages.

Even within the same numeric language (arabic numbers) it is not inconceivable programmers may cross-wire their version-numbering rote memory with their mathematic rote memory, when task-switching rapidly between mathematical-greaters and versioning-greaters multiple times per hour. I've even b0rk3d up a version-number-comparision function (is version X.Y greater than X.Y?") in C/C++ by accident because I was programming out of my rote memory (Level 1 behavior), but realized my mistake quickly once I saw the debug console output.

At least, that particular hallucination is more like a "mis-recall" that is (mostly) explainable from two very clear fronts (1) tokenization, and (2) math-vs-versioning, and a third probable contributory front (3) arabic numbers aren't LLMs native language (like roman numerals generally arent to us anymore).

Psyborgue · Sep 16, 2024

mdrejhon said:
I've even b0rk3d up a version-number-comparision function (is version X.Y greater than X.Y?") in C/C++ by accident

There are lots of things that seem like they are simple on their face but are in fact extraordinary easy to fuck up. Reason I usually reach for a library for that kind of thing.

The first thing that popped into my head when I saw those numbers was versions, not decimal numbers. Major version 9, minor 9 and 11. If I responded by instinct and without thinking it out there's a very good chance I might respond the same way. Perspective is a fascinating thing.

corvaja72 said:
Frankly, I did find it insulting to fly over what someone has written beyond an initial catch line, which tries and expands on what is intended for "deciding", where choices are weighted depending on the understanding of the context and, eventually, asking for further clarification if that context is ambiguous, instead of just weighting peaks of correlation functions and the like to come with the most appropriate statistical outcome.
Anyways, just because of what is written above, I am able to evaluate that the threshold of what can be considered objectively insulting and what can be interpreted as insulting differ considerably among people who do not know each other (i.e. the context), so I decide to sincerely apologize for being or sounding rude.

You might consider having a language model rewrite your posts for readability. Not meaning to be insulting. It's just an observation .

whynottoday · Sep 28, 2024

Ok so just reviving this thread here bc I think a lot of you would be interested in knowing how o1 actually works / how the reinforcement learning in it works. Most explanation videos I've seen have been pure fluff, but THIS one is the best one I’ve found so far. She's an actual machine learning engineer too, seems to work at Apple. I learned a lot!

View: https://youtu.be/6UxFkU0LI8g?si=Lj3fh8xQyKbSpifF

Search

Search

OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

Psyborgue

Account Banned

More options

corvaja72

Smack-Fu Master, in training

More options

mdrejhon

Ars Tribunus Militum

More options

Psyborgue

Account Banned

More options

whynottoday

Smack-Fu Master, in training

More options

nproxy.org