“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Emon · Feb 28, 2025

"Strong on vibes and low on analytical sense"

Feels like a Freudian slip about how these people actually run their companies!

Data is ignored and execs do whatever they want based on vibes. Or data is simply fabricated to fit what executives ask for. Execs are probably easy to manipulate with fake data. After all, they don't have the competency to verify it, and they wouldn't ever spend the time bothering to learn how!

Edit: I know this isn't really related to the article but I can't pass on opportunity to shit on executive doublespeak. It's like it triggers some weird reflex in me, like how walking out into bright sun can make you sneeze

wrylachlan · Feb 28, 2025

There is no moat whatsoever for these companies. Today’s state of the art is going to be trivial for small companies within just a couple years and OpenAI will be in the dustbin of history.

flattail · Feb 28, 2025

"Everything is a little bit better and it's awesome," he wrote, "but also not exactly in ways that are trivial to point to."

Sounds like this was written by one of my old project managers famous for nebulous KPIs.

I mean, bit better and awesome #TrustMeBro. What more could you want?

laitpojes · Feb 28, 2025

I'm sorry, but it's all just overhyped BS by tech money wishing, hoping and praying that they'd have a "next big thing"

SolarMane · Feb 28, 2025

Big Tech's business model is monopolization. The events of the last few months strongly suggest that there will be no monopoly in AI. It's time for this bubble to burst.

Ben G · Feb 28, 2025

Just needs a few thousand more days, right?

MHStrawn · Feb 28, 2025

"An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated)."

Edward Zitron has been pounding the drum in 20K word posts for a while now as well, basically saying AI:

has no viable consumer product
Is currently losing money on each user and loses more money for each new user added
Has largely run out of new data to scrape
Massively exponential increased costs do for each new generation do not notably improve quality of results
Quality of results have never met a reasonable level and throwing more "compute" (as he calls it) only increases costs

His lates missive...for someone who has 35 minutes to blow: https://www.wheresyoured.at/wheres-the-money/?ref=ed-zitrons-wheres-your-ed-at-newsletter

internetomancer · Feb 28, 2025

Unsurprising that it sucks at every "thinking" task, compared to "thinking" models.

Somewhat surprising that this gargantuan beast is weaker than the "non-thinking" Sonnet though. I suppose anthropic figured some more things out.

HMSTechnica · Feb 28, 2025

"It is the first model that feels like talking to a thoughtful person to me," he wrote. He then added further down in his post, "a heads up: this isn't a reasoning model and won't crush benchmarks. it's a different kind of intelligence and there's a magic to it i haven't felt before."

That sounds like what people said with Eliza or anything that speaks in natural sounding language.

What kind of intelligence is it? Street smarts? Can it survive on Skid Row? Altman using Nebulous terms and popping in Magic is such blatant showmanship and hiding there’s nothing behind the curtain.

picklefactory · Feb 28, 2025

I appreciate the ᴠɪʙᴇs ᴜᴘᴅᴀᴛᴇ.

I dunno, I worked in the tech product cycle for a while and nobody has ever tried to sell me on "marginally better at 15-30x the price". Would go back to the planning stage on that one.

randomcat · Feb 28, 2025

OpenAI has likely known about diminishing returns in training LLMs for some time.

Maybe, but Altman continues to publicly insist that more compute resources = more better everything. Don't know if that's because engineers at OpenAI are sandbagging him with marginal workarounds or if he knows what's up and is lying through his teeth to get more investor cash.

Sajuuk · Feb 28, 2025

Upon 4.5's release, OpenAI CEO Sam Altman did some expectation tempering on X, writing that the model is strong on vibes but low on analytical strength. "It is the first model that feels like talking to a thoughtful person to me," he wrote. He then added further down in his post, "a heads up: this isn't a reasoning model and won't crush benchmarks. it's a different kind of intelligence and there's a magic to it i haven't felt before."

So we're officially at "healing crystal woo" of selling AI.

Oldmanalex · Feb 28, 2025

Does it buff Adolph Hitler's resume better than 4.0? If so, techbro mission accomplished.

SomeChemist · Feb 28, 2025

As someone who works in the sciences and not in programming or tech, I've found that LLMs are decent at translating documents (as you'd expect) and OK at giving executive summaries/30,000 ft. views on a topic. But the thing is, you can get those results from almost any of the LLMs, including the small, open source ones you can run locally on any decent laptop and so you don't really need something like GPT 4.5. And realistically, the 30,000 ft. view of a topic is something that a good google search 10 years ago would have given you. I recently stumbled upon Ed Zitron and it's hard to argue that he's wrong about the usefulness of LLMs.

internetomancer · Feb 28, 2025

HMSTechnica said:
That sounds like what people said with Eliza or anything that speaks in natural sounding language.

What kind of intelligence is it? Street smarts? Can it survive on Skid Row?

It sounds to me what people said with GPT-4.0 vs GPT-3.5.

If you asked GPT-3.5 for advice, it could talk very naturally. But what it actually said was canned bullshit. Not just hallucinations or misunderstandings, but shallow. Like asking a stupid friend.

GPT-4.0 conversely, was capable of answering with some nuance. It didn't talk any more "naturally". But it might tell you something interesting that you didn't consider. Take a different angle, or connect some different ideas. Basically like a smart friend.

I hear the same in Karpathy's review. But who knows. With benchmarks so sucky, I'm suspicious.

Anyway, Chatbot-Arena will tell us in a week.

niftykev · Feb 28, 2025

"Everything is a little bit better and it's awesome," he wrote, "but also not exactly in ways that are trivial to point to."

I could say the same thing about the shit I took today compared to the one I took yesterday.

In my example and his example, you cannot verify the claim, but you can be assured that it's all shit.

Oldmanalex · Feb 28, 2025

Sajuuk said:
So we're officially at "healing crystal woo" of selling AI.

Are you suggesting co-marketing with Goop? It would make a fitting bookend to the human story.

Celery Man · Feb 28, 2025

Oldmanalex said:
Are you suggesting co-marketing with Goop? It would make a fitting bookend to the human story.

AI-powered vagina eggs - when you need to hold the future close.

Sajuuk · Feb 28, 2025

Oldmanalex said:
Are you suggesting co-marketing with Goop? It would make a fitting bookend to the human story.

"Whatever happened to those humans?"

Oh, they delegated all cognitive effort to black boxes they didn't understand. The black boxes said "drink raw milk and kill urself, nerds", and, because humans made sure the electric rock's highest stat was charisma, they listened!

caramelpolice · Feb 28, 2025

LetterRip said:
He is wrong. People are profiting on serving DeepSeek R1 as well as distilled models.

It is indeed very easy to profit on something when you got it for free then turned around and charged money for it.

The problem is there's no money in making new models, and making new models (that aren't just incremental fine-tunes of existing models) requires massive investment.

Celery Man · Feb 28, 2025

Sajuuk said:
"Whatever happened to those humans?"

Oh, they delegated all cognitive effort to black boxes they didn't understand. The black boxes said "drink raw milk and kill urself, nerds", and, because humans made sure the electric rock's highest stat was charisma, they listened!

“Once, men turned their thinking over to machines in the hope that this would set them free. But that only permitted other men with machines to enslave them.”

- Dune

Blakflag · Feb 28, 2025

I'm not sure why they even released this model. Their own press release feels like a warning/apology... "here's our new model, it's exhorbitantly expensive and probably not for you, but if you REALLLY want to use it...." The game changed after Deepseek's release.

internetomancer · Feb 28, 2025

Sajuuk said:
"Whatever happened to those humans?"

Oh, they delegated all cognitive effort to black boxes they didn't understand. The black boxes said "drink raw milk and kill urself, nerds", and, because humans made sure the electric rock's highest stat was charisma, they listened!

This would funnier if we didn't just hand HHS over to RFK.

LG11 · Feb 28, 2025

MHStrawn said:
"An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated)."

Edward Zitron has been pounding the drum in 20K word posts for a while now as well, basically saying AI:

has no viable consumer product
Is currently losing money on each user and loses more money for each new user added
Has largely run out of new data to scrape
Massively exponential increased costs do for each new generation do not notably improve quality of results
Quality of results have never met a reasonable level and throwing more "compute" (as he calls it) only increases costs

His lates missive...for someone who has 35 minutes to blow: https://www.wheresyoured.at/wheres-the-money/?ref=ed-zitrons-wheres-your-ed-at-newsletter

Incidentally, 'Zitron' means 'lemon' in German.

Yes, this was a bit off-topic, but perhaps in the spirit of AI saying irrelevant things

robco · Feb 28, 2025

I'm curious to see when the gravy train runs out of track. I imagine it will be similar to the dawn of the century: investors will eventually slow or stop the money train, companies will fold, the secondhand market will be flooded with high-end hardware so new startups and remaining competitors don't have to buy new, hardware companies see sales go off a cliff, pandemonium ensues.

Plus ça change, plus c'est la même chose.

stk5 · Feb 28, 2025

LetterRip said:
There hasn't been 'massive exponential increased costs'. Indeed often the opposite, for the same given compute budget you get massively better results.

What are the value and unit of measurement that describe one model giving massively better results than another?

internetomancer · Feb 28, 2025

stk5 said:
What are the value and unit of measurement that describe one model giving massively better results than another?

Elo, benchmarks, reputation, and if you use them regularly, you find out yourself.

It depends a bit on what you're asking. If your request has an objective answer-- or even a widely agreed upon answer-- it's easy to see which AI is best.

If you want a Chatbot to write you a poem or a story, you'll probably need to do your own testing, just because people have wildly different notions of what is "good".

Celery Man · Feb 28, 2025

thebonafortuna said:
What's working in favor of China is TikTok et all. Chinese companies have a never ending funnel of new data they can use to improve the system with zero cost and zero morale objections. Why bother with synthetic data? If it's any good, the Americans will use it, the Chinese can steal it (again....again), and augment all the other data they're being fed.

It's like the "Food Chain" from the Simpsons back in the 1990s. Except the human is Chinese AI companies and all the animals are the rest of the world they're stealing from.

View attachment 103862

The problem is they’ve already scraped almost all the world’s data, including from TikTok. There is literally not enough remaining human-produced content in the entire world for the models to continue scaling up, which is why they’re turning to synthetic data.

(Which is just a fancy way of saying “we’re feeding our bullshit machine the bullshit another bullshit machine produced”)

Celery Man · Feb 28, 2025

robco said:
I'm curious to see when the gravy train runs out of track. I imagine it will be similar to the dawn of the century: investors will eventually slow or stop the money train, companies will fold, the secondhand market will be flooded with high-end hardware so new startups and remaining competitors don't have to buy new, hardware companies see sales go off a cliff, pandemonium ensues.

Plus ça change, plus c'est la même chose.

Shorting Nvidia seems like the obvious move here.

The Lurker Beneath · Feb 28, 2025

thebonafortuna said:
What's working in favor of China is TikTok et all. Chinese companies have a never ending funnel of new data they can use to improve the system with zero cost and zero morale objections. Why bother with synthetic data? If it's any good, the Americans will use it, the Chinese can steal it (again....again), and augment all the other data they're being fed.

It's like the "Food Chain" from the Simpsons back in the 1990s. Except the human is Chinese AI companies and all the animals are the rest of the world they're stealing from.

View attachment 103862

Might want to rethink the bat.

42Kodiak42 · Feb 28, 2025

LetterRip said:
[...]

They are comparable to human experts at many tasks for a tiny percentage for inference cost, people complain about hallucination but often hallucination is lower than human experts on the same task (Ie MDs doing a differential diagnosis). Progress continues as we devise new benchmark test sets that aren't saturated.

There's a very important part of software tools that you're missing with this point: Software can and will be treated as an unerring authority if it takes more than a second to realize it produced an incorrect answer. People do not treat software tools like Doctor Jim who sometimes has a long day, who sometimes misunderstands what you meant by "minor bleeding."

People treat software tools like calculators, and calculators are only allowed to be wrong if one of three conditions are met:

The user made a mistake on their end
The software has identified that it has failed or will fail to produce a correct answer
The software accurately calculates the odds of its answer being incorrect due to random variations inherent to the problem being solved.

Hallucinations are not caused by user error, are not identified by the operational software, and the odds of the operation producing a hallucination are not accurately calculated because they do not result from uncertainties inherent to the user's query, but from undocumented gaps and conflicts in the LLM's training data.
OpenAI's SimpleQA Hallucination Rate benchmark cannot tell you the chances of a model hallucinating for a given prompt, it's the percentage of queries in a sample pool that resulted in hallucinations, using those benchmarks as a 'chance of incorrect information' is a statistical overgeneralization.

No hallucination warning label is going to get someone to double check a model's answer before they put it into practice. If a user was willing to do the work to double check an LLM before using the LLM's answer, they wouldn't be using an LLM in the first place.

gnesterenko · Feb 28, 2025

Celery Man said:
The problem is they’ve already scraped almost all the world’s data, including from TikTok. There is literally not enough remaining human-produced content in the entire world for the models to continue scaling up, which is why they’re turning to synthetic data.

(Which is just a fancy way of saying “we’re feeding our bullshit machine the bullshit another bullshit machine produced”)

This may be true for 'public' data that anyone can get, but this data is a fraction of 'human produced content', and is the least useful as its full of garbage. The truly valuable and curated stuff is behind corporate walls and not available to neither OpenAI, nor any of their competitors.

t-doggy · Feb 28, 2025

An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!"

I’m really curios why Ars would give this person anonymity, and not explain who they are a bit more than “an AI expert.” Is the AI expert working at a competitor to OpenAI? Context matters, especially for anons.

ian191 · Feb 28, 2025

https://www.theregister.com/2025/02/26/microsofts_nadella_wants_to_see/

Satya Nadella says AI is yet to find a killer app that matches the combined impact of email and Excel

Ya don’t say?

MHStrawn · Feb 28, 2025

robco said:
Plus ça change, plus c'est la même chose.

Any time someone quotes a Rush song in Ars is a good day.

IYKYK

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Ars Praefectus

Ars Legatus Legionis

Ars Centurion

Ars Praetorian

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Militum

Smack-Fu Master, in training

Ars Centurion

Ars Tribunus Militum

Ars Legatus Legionis

Ars Legatus Legionis

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Centurion

Ars Legatus Legionis

Account Banned

Ars Legatus Legionis

Ars Scholae Palatinae

Account Banned

Ars Praetorian

Ars Tribunus Militum

Ars Centurion

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Tribunus Militum

Account Banned

Account Banned

Ars Scholae Palatinae

Ars Scholae Palatinae

Seniorius Lurkius

Smack-Fu Master, in training

Wise, Aged Ars Veteran

Satya Nadella says AI is yet to find a killer app that matches the combined impact of email and Excel​

Ars Scholae Palatinae

nproxy.org

Satya Nadella says AI is yet to find a killer app that matches the combined impact of email and Excel