OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

.airstrike · Sep 12, 2024

"I still have trouble defining 'reasoning' in terms of LLM capabilities. I’d be interested in finding a prompt which fails on current models but succeeds on strawberry that helps demonstrate the meaning of that term."

That is literally in their announcement post under "Chain-of-Thought"

https://openai.com/index/learning-to-reason-with-llms/

Aurich · Sep 12, 2024

I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

coopster · Sep 12, 2024

Pattern matching probabilistic output machine can't "reason" or "think" and it sounds suspiciously like they are pretty much just automating the prompt engineering garbage that people go through to get an improved response.

"Spends time thinking" means running it again but saying "this time work harder"

150 Billion....

squawkingVFR · Sep 12, 2024

Aurich said:
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214

This is hilarious.

Little-Zen · Sep 12, 2024

man. That was such an easy example to give people when they asked me why they shouldn’t just “trust what the computer says”. I could go on and on about hallucinations, how they’d make up court cases and explain weights and probabilities and training data sets, but it wouldn’t click.

But this worked, every time:

“Look man, you can count the letters in this word for yourself. If it can’t even get that right, why would you trust it to give you correct reference information?”

If it can actually do it now I’m gonna need to find a different example.

Fred Duck · Sep 12, 2024

"It's Thinking?"

We've all heard this before.

I'm sure there are going to be all sorts of legal issues with this, but I hope they can be resolved amicably.

Emon · Sep 12, 2024

Aurich said:
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214

I found it works to tell it to check its tokenization. This is from three weeks ago. Well, it sort of worked...

Snark218 · Sep 12, 2024

Simon Willison tweeted in response to a Bloomberg story about Strawberry, "I still have trouble defining 'reasoning' in terms of LLM capabilities. I’d be interested in finding a prompt which fails on current models but succeeds on strawberry that helps demonstrate the meaning of that term.

I suspect he will be searching for a very long time.

Emon · Sep 12, 2024

coopster said:
sounds suspiciously like they are pretty much just automating the prompt engineering garbage that people go through to get an improved response.

I won't be surprised if we find out there's a "data center" in India that's actually 200 guys Mechanical Turking it

.airstrike · Sep 12, 2024

mldelgado said:
Not impressed. Anyone can build an AutoGPT-like solution that brute forces the model into progressively producing better outputs. I know because I built the same thing with off-the-shelf Llama 2 to stop hallucinations when indexing and reading ebooks. All you need is a GPU of your own to run inference, and you can hit it all day long for free.

https://www.hackster.io/mrmlcdelgado/pytldr-317c1d

This reads like that famous Dropbox comment...

"I have a few qualms with this app:

1. For a Linux user, you can already build such a system yourself quite trivially by getting an FTP account, mounting it locally with curlftpfs, and then using SVN or CVS on the mounted filesystem. From Windows or Mac, this FTP account could be accessed through built-in software.

2. It doesn't actually replace a USB drive. Most people I know e-mail files to themselves or host them somewhere online to be able to perform presentations, but they still carry a USB drive in case there are connectivity problems. This does not solve the connectivity issue.

3. It does not seem very "viral" or income-generating. I know this is premature at this point, but without charging users for the service, is it reasonable to expect to make money off of this?"

SplatMan_DK · Sep 12, 2024

They should get their ass in gear and update DALL-E. It makes the same "strawberry mistakes" as it did a year ago, while having twice as many "content restrictions" when generating images. Thanks Disney. :-/

agt499 · Sep 12, 2024

Aurich said:
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214

In showing this to a colleague last week ChatGPT got strawberry = 3 r's right off the bat. I suspect that they had 'tuned' the model directly after recent coverage (aka hard coded the answer), as raspberry was still returning a count between 1 and 4.

GaggiX · Sep 12, 2024

SplatMan_DK said:
They should get their ass in gear and update DALL-E. It makes the same "strawberry mistakes" as it did a year ago, while having twice as many "content restrictions" when generating images. Thanks Disney. :-/

Just switch to Ideogram, it does work really well, not everything should be from OpenAI.

agt499 · Sep 12, 2024

Little-Zen said:
man. That was such an easy example to give people when they asked me why they shouldn’t just “trust what the computer says”. I could go on and on about hallucinations, how they’d make up court cases and explain weights and probabilities and training data sets, but it wouldn’t click.

But this worked, every time:

“Look man, you can count the letters in this word for yourself. If it can’t even get that right, why would you trust it to give you correct reference information?”

If it can actually do it now I’m gonna need to find a different example.

Just try any different word at all - if it works for strawberry now they probably hard coded.

coopster · Sep 12, 2024

Normal people: Shouldn't the most advanced AI in existence be able to count the number of R's in strawberry?
AI Bros: Of course it can....for a 400% reduction in performance

Achilles · Sep 12, 2024

"OpenAI, count the number of fucks in how many fucks I give?"

Hezio · Sep 12, 2024

Now can it count how many s's in Mississippi?

poltroon · Sep 12, 2024

The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."

Dachannien · Sep 12, 2024

The problem now, though, is that the well on the "strawberry" sample problem is tainted. There are numerous posts on Reddit and elsewhere, likely incorporated into more recent training sets, where people make fun of previous ChatGPTs for getting this question wrong. New models don't need to do any kind of reasoning in "letter space", because they will already have plenty of information in "language space" where people have associated the answer 2 with negative connotations and 3 with positive ones. With that data now being part of the training set, relying on that question as evidence of improvement is a classic example of letting your training set leak into your validation or test sets.

So the real question is, given an arbitrary sequence of characters - or better yet, an arbitrary sequence of multi-letter tokens - can it still determine the correct occurrences of a given letter generally?

Jarrex · Sep 12, 2024

poltroon said:
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."

We won't get an LLM like that because that's not how LLMs operate - it has no concept of tries or probability of accuracy.

noxylophone · Sep 12, 2024

This is the first model I’ve used that can deal with LSAT logic games, so that’s cool.

Jarrex · Sep 12, 2024

Can someone with GPT plus ask Strawberry
"how many rs are in the red fruit that is commonly used for daiquiris"

CoffeeFoo · Sep 12, 2024

agt499 said:
Just try any different word at all - if it works for strawberry now they probably hard coded.

Three r's in bookkeeper. It's sure.

Mustachioed Copy Cat · Sep 12, 2024

Someone needs to locate the person in charge of these ridiculous naming conventions and demote them back to naming all the Japanese RPGs.

noxylophone · Sep 12, 2024

Jarrex said:
Can someone with GPT plus ask Strawberry
"how many rs are in the red fruit that is commonly used for daiquiris"

The red fruit commonly used for daiquiris is the strawberry. The word "strawberry" contains two letter "r"s. Therefore, there are 2 "r"s in the word "strawberry".

I asked it a second time and it responded:

The red fruit commonly used for daiquiris is strawberry. Counting the number of 'r's in "strawberry":

s t r a w b e r r y

There are three 'r's in "strawberry".

Answer: 3

poochyena · Sep 12, 2024

We need to ask if it will pull the lever to stop the trolley or not.

semo · Sep 12, 2024

mldelgado said:
Not impressed. Anyone can build an AutoGPT-like solution that brute forces the model into progressively producing better outputs. I know because I built the same thing with off-the-shelf Llama 2 to stop hallucinations when indexing and reading ebooks. All you need is a GPU of your own to run inference, and you can hit it all day long for free.

https://www.hackster.io/mrmlcdelgado/pytldr-317c1d

An LLM chat bot that doesn't hallucinate? Sounds too good to be true.

Mustachioed Copy Cat · Sep 12, 2024

Jarrex said:
We won't get an LLM like that because that's not how LLMs operate - it has no concept of tries or probability of accuracy.

Error correcting through repetition / reproducibility should be part of the service, even if it isn’t a native aspect of the LLM part of that service.

Carewolf · Sep 12, 2024

Hezio said:
Now can it count how many s's in Mississippi?

How many s are in 6 mississippis during an s storm?

flyingtoastr · Sep 12, 2024

Emon said:
I won't be surprised if we find out there's a "data center" in India that's actually 200 guys Mechanical Turking it

I work in the electricity sector, and with how much power these guys are gobbling up daily for their data centers that they're just plopping down willy-nilly, they're definitely doing something.

What that something is, I have no idea. Seems wild how many gigawatts of electricity they're jacking just to run fancy predictive text.

forkspoon · Sep 12, 2024

Just after the OpenAI o1 announcement, Hugging Face CEO Clement Delangue wrote, "Once again, an AI system is not 'thinking', it's 'processing', 'running predictions',...

I appreciate what he’s getting at, but are humans not doing such things? We can catch a ball fairly reliably before learning calculus explicitly.

The crux seems to be that it’s difficult to comprehensively itemize differences between human thinking and state of the art AI… processes(?).

IMO that difficultly goes a long way to excusing non-experts who call it “thinking.”

agt499 · Sep 12, 2024

CoffeeFoo said:
Three r's in bookkeeper. It's sure.

Given that it's quite good at generating code that almost works, I do wonder about asking it to "write a python function to count the number of letters in a word", and then ask it to evaluate the output (or is the latter not feasible?).
Of course the python code will look right but not actually work...

Ninja Puffin · Sep 12, 2024

poltroon said:
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."

The root cause is that it isn't thinking or reasoning at all. The language-model is just a really big statistical analysis of what words get used together. It determines what words are most likely to appear, then picks randomly using the probabilities generated by the language-model. Also it looks like they haven't even installed a patch for the type of question, just hard-coded an answer for that particular question.

EnPeaSea · Sep 12, 2024

But there are 2 r's in "strawberry"; there is also a 3rd.

Jarrex · Sep 12, 2024

forkspoon said:
I appreciate what he’s getting at, but are humans not doing such things? We can catch a ball fairly reliably before learning calculus explicitly.

The crux seems to be that it’s difficult to comprehensively itemize differences between human thinking and state of the art AI… processes(?).

IMO that difficultly goes a long way to excusing non-experts who call it “thinking.”

In order for a human to learn to not do something it doesn't need to re-learn literally everything it has ever experienced in life. Algorithms do. Algorithms do not think or have anything close to thinking. It just predicts things. The same way you can draw a line across data points and predict values that aren't there. The dataset and complexity of what it's doing has just increased.

kinpin · Sep 12, 2024

Aurich said:
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214

This is interesting, I see Chatgp gets it wrong but both copilot and Gemini gets it right

intellam · Sep 12, 2024

Fails if you intentionally misspell a word. It gets too excited to correct your spelling that it forgets its hardcoded letter counting.

nick73whm · Sep 12, 2024

poltroon said:
The strawberry problem is so interesting to me because it's a question that is honestly completely irrelevant to any valuable use of the tool. It (this specific output) only has to be solved for PR/marketing reasons.

Have they really gotten at the root cause, or is this just a patch that solves this particular kind of question? How can you generalize finding this mistake, and when multiple methods get different answers, how do you pick the correct one in a general way?

It seems like we might benefit from valuing the LLM to have some humility and say, "I tried these three ways and I think this answer is the most likely but also consider if it might be these other answers."

Given they train the models on social media content and this has been so widely discussed lately any new models going forward are probably going to recognize the question and respond "Three, and I see what you're doing there."

OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

Wise, Aged Ars Veteran

Director of Many Things

Ars Centurion

Ars Praetorian

Ars Praefectus

Ars Praefectus

Ars Praefectus

Ars Legatus Legionis

Ars Praefectus

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Centurion

Ars Scholae Palatinae

Ars Praefectus

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Praetorian

Wise, Aged Ars Veteran

Ars Praetorian

Wise, Aged Ars Veteran

Ars Praefectus

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Centurion

Ars Praefectus

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Centurion

Ars Praetorian

Ars Praetorian

Ars Tribunus Militum

Attachments

Seniorius Lurkius

Seniorius Lurkius

nproxy.org