OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

.airstrike

Wise, Aged Ars Veteran
113
"I still have trouble defining 'reasoning' in terms of LLM capabilities. I’d be interested in finding a prompt which fails on current models but succeeds on strawberry that helps demonstrate the meaning of that term."

That is literally in their announcement post under "Chain-of-Thought"

https://openai.com/index/learning-to-reason-with-llms/
 
Upvote
25 (27 / -2)

coopster

Ars Centurion
238
Subscriptor
Pattern matching probabilistic output machine can't "reason" or "think" and it sounds suspiciously like they are pretty much just automating the prompt engineering garbage that people go through to get an improved response.

"Spends time thinking" means running it again but saying "this time work harder"

150 Billion....
 
Upvote
46 (88 / -42)

Little-Zen

Ars Praefectus
3,168
Subscriptor
man. That was such an easy example to give people when they asked me why they shouldn’t just “trust what the computer says”. I could go on and on about hallucinations, how they’d make up court cases and explain weights and probabilities and training data sets, but it wouldn’t click.

But this worked, every time:

“Look man, you can count the letters in this word for yourself. If it can’t even get that right, why would you trust it to give you correct reference information?”

If it can actually do it now I’m gonna need to find a different example.
 
Upvote
114 (121 / -7)
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214
I found it works to tell it to check its tokenization. This is from three weeks ago. Well, it sort of worked...
1726168821346.png
 
Upvote
113 (114 / -1)

Snark218

Ars Legatus Legionis
32,581
Subscriptor
Simon Willison tweeted in response to a Bloomberg story about Strawberry, "I still have trouble defining 'reasoning' in terms of LLM capabilities. I’d be interested in finding a prompt which fails on current models but succeeds on strawberry that helps demonstrate the meaning of that term.
I suspect he will be searching for a very long time.
 
Upvote
-1 (7 / -8)
Post content hidden for low score. Show…

.airstrike

Wise, Aged Ars Veteran
113
Not impressed. Anyone can build an AutoGPT-like solution that brute forces the model into progressively producing better outputs. I know because I built the same thing with off-the-shelf Llama 2 to stop hallucinations when indexing and reading ebooks. All you need is a GPU of your own to run inference, and you can hit it all day long for free.

https://www.hackster.io/mrmlcdelgado/pytldr-317c1d

This reads like that famous Dropbox comment...

"I have a few qualms with this app:

1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.

3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?"
 
Upvote
76 (79 / -3)

agt499

Ars Tribunus Militum
2,029
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214
In showing this to a colleague last week ChatGPT got strawberry = 3 r's right off the bat. I suspect that they had 'tuned' the model directly after recent coverage (aka hard coded the answer), as raspberry was still returning a count between 1 and 4.
 
Upvote
79 (80 / -1)

GaggiX

Smack-Fu Master, in training
44
They should get their ass in gear and update DALL-E. It makes the same "strawberry mistakes" as it did a year ago, while having twice as many "content restrictions" when generating images. Thanks Disney. :-/
Just switch to Ideogram, it does work really well, not everything should be from OpenAI.
 
Upvote
-3 (4 / -7)

agt499

Ars Tribunus Militum
2,029
man. That was such an easy example to give people when they asked me why they shouldn’t just “trust what the computer says”. I could go on and on about hallucinations, how they’d make up court cases and explain weights and probabilities and training data sets, but it wouldn’t click.

But this worked, every time:

“Look man, you can count the letters in this word for yourself. If it can’t even get that right, why would you trust it to give you correct reference information?”

If it can actually do it now I’m gonna need to find a different example.
Just try any different word at all - if it works for strawberry now they probably hard coded.
 
Upvote
33 (38 / -5)

poltroon

Ars Tribunus Militum
1,673
Subscriptor
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."
 
Upvote
34 (44 / -10)

Dachannien

Ars Scholae Palatinae
995
Subscriptor
The problem now, though, is that the well on the "strawberry" sample problem is tainted. There are numerous posts on Reddit and elsewhere, likely incorporated into more recent training sets, where people make fun of previous ChatGPTs for getting this question wrong. New models don't need to do any kind of reasoning in "letter space", because they will already have plenty of information in "language space" where people have associated the answer 2 with negative connotations and 3 with positive ones. With that data now being part of the training set, relying on that question as evidence of improvement is a classic example of letting your training set leak into your validation or test sets.

So the real question is, given an arbitrary sequence of characters - or better yet, an arbitrary sequence of multi-letter tokens - can it still determine the correct occurrences of a given letter generally?
 
Upvote
58 (61 / -3)
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."
We won't get an LLM like that because that's not how LLMs operate - it has no concept of tries or probability of accuracy.
 
Upvote
22 (31 / -9)

noxylophone

Wise, Aged Ars Veteran
122
Can someone with GPT plus ask Strawberry
"how many rs are in the red fruit that is commonly used for daiquiris"

The red fruit commonly used for daiquiris is the strawberry. The word "strawberry" contains two letter "r"s. Therefore, there are 2 "r"s in the word "strawberry".

I asked it a second time and it responded:

The red fruit commonly used for daiquiris is strawberry. Counting the number of 'r's in "strawberry":

s t r a w b e r r y

There are three 'r's in "strawberry".

Answer: 3
 
Upvote
44 (44 / 0)
Not impressed. Anyone can build an AutoGPT-like solution that brute forces the model into progressively producing better outputs. I know because I built the same thing with off-the-shelf Llama 2 to stop hallucinations when indexing and reading ebooks. All you need is a GPU of your own to run inference, and you can hit it all day long for free.

https://www.hackster.io/mrmlcdelgado/pytldr-317c1d
An LLM chat bot that doesn't hallucinate? Sounds too good to be true.
 
Upvote
6 (9 / -3)

Mustachioed Copy Cat

Ars Praefectus
4,791
Subscriptor++
We won't get an LLM like that because that's not how LLMs operate - it has no concept of tries or probability of accuracy.
Error correcting through repetition / reproducibility should be part of the service, even if it isn’t a native aspect of the LLM part of that service.
 
Upvote
4 (4 / 0)

flyingtoastr

Smack-Fu Master, in training
80
I won't be surprised if we find out there's a "data center" in India that's actually 200 guys Mechanical Turking it

I work in the electricity sector, and with how much power these guys are gobbling up daily for their data centers that they're just plopping down willy-nilly, they're definitely doing something.

What that something is, I have no idea. Seems wild how many gigawatts of electricity they're jacking just to run fancy predictive text.
 
Upvote
44 (49 / -5)

forkspoon

Ars Scholae Palatinae
687
Subscriptor++
Just after the OpenAI o1 announcement, Hugging Face CEO Clement Delangue wrote, "Once again, an AI system is not 'thinking', it's 'processing', 'running predictions',...

I appreciate what he’s getting at, but are humans not doing such things? We can catch a ball fairly reliably before learning calculus explicitly.

The crux seems to be that it’s difficult to comprehensively itemize differences between human thinking and state of the art AI… processes(?).

IMO that difficultly goes a long way to excusing non-experts who call it “thinking.”
 
Upvote
-4 (17 / -21)

agt499

Ars Tribunus Militum
2,029
Three r's in bookkeeper. It's sure.
Given that it's quite good at generating code that almost works, I do wonder about asking it to "write a python function to count the number of letters in a word", and then ask it to evaluate the output (or is the latter not feasible?).
Of course the python code will look right but not actually work...
 
Upvote
10 (10 / 0)

Ninja Puffin

Ars Centurion
266
Subscriptor
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."
The root cause is that it isn't thinking or reasoning at all. The language-model is just a really big statistical analysis of what words get used together. It determines what words are most likely to appear, then picks randomly using the probabilities generated by the language-model. Also it looks like they haven't even installed a patch for the type of question, just hard-coded an answer for that particular question.
 
Upvote
22 (32 / -10)
I appreciate what he’s getting at, but are humans not doing such things? We can catch a ball fairly reliably before learning calculus explicitly.

The crux seems to be that it’s difficult to comprehensively itemize differences between human thinking and state of the art AI… processes(?).

IMO that difficultly goes a long way to excusing non-experts who call it “thinking.”
In order for a human to learn to not do something it doesn't need to re-learn literally everything it has ever experienced in life. Algorithms do. Algorithms do not think or have anything close to thinking. It just predicts things. The same way you can draw a line across data points and predict values that aren't there. The dataset and complexity of what it's doing has just increased.
 
Upvote
4 (15 / -11)

kinpin

Ars Tribunus Militum
1,633
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214
This is interesting, I see Chatgp gets it wrong but both copilot and Gemini gets it right
 

Attachments

  • IMG_5418.jpeg
    IMG_5418.jpeg
    48.4 KB · Views: 35
  • IMG_5420.jpeg
    IMG_5420.jpeg
    42.2 KB · Views: 25
Upvote
0 (2 / -2)
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."
Given they train the models on social media content and this has been so widely discussed lately any new models going forward are probably going to recognize the question and respond "Three, and I see what you're doing there."
 
Upvote
30 (30 / 0)