“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

"Strong on vibes and low on analytical sense"

Feels like a Freudian slip about how these people actually run their companies!

Data is ignored and execs do whatever they want based on vibes. Or data is simply fabricated to fit what executives ask for. Execs are probably easy to manipulate with fake data. After all, they don't have the competency to verify it, and they wouldn't ever spend the time bothering to learn how!

Edit: I know this isn't really related to the article but I can't pass on opportunity to shit on executive doublespeak. It's like it triggers some weird reflex in me, like how walking out into bright sun can make you sneeze
 
Upvote
197 (213 / -16)
Post content hidden for low score. Show…

flattail

Ars Centurion
275
Subscriptor
"Everything is a little bit better and it's awesome," he wrote, "but also not exactly in ways that are trivial to point to."
Sounds like this was written by one of my old project managers famous for nebulous KPIs.

I mean, bit better and awesome #TrustMeBro. What more could you want?
 
Upvote
147 (148 / -1)

MHStrawn

Ars Scholae Palatinae
1,248
Subscriptor
"An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated)."

Edward Zitron has been pounding the drum in 20K word posts for a while now as well, basically saying AI:

has no viable consumer product
Is currently losing money on each user and loses more money for each new user added
Has largely run out of new data to scrape
Massively exponential increased costs do for each new generation do not notably improve quality of results
Quality of results have never met a reasonable level and throwing more "compute" (as he calls it) only increases costs

His lates missive...for someone who has 35 minutes to blow: https://www.wheresyoured.at/wheres-the-money/?ref=ed-zitrons-wheres-your-ed-at-newsletter
 
Upvote
165 (174 / -9)

HMSTechnica

Smack-Fu Master, in training
76
"It is the first model that feels like talking to a thoughtful person to me," he wrote. He then added further down in his post, "a heads up: this isn't a reasoning model and won't crush benchmarks. it's a different kind of intelligence and there's a magic to it i haven't felt before."

That sounds like what people said with Eliza or anything that speaks in natural sounding language.

What kind of intelligence is it? Street smarts? Can it survive on Skid Row? Altman using Nebulous terms and popping in Magic is such blatant showmanship and hiding there’s nothing behind the curtain.
 
Upvote
93 (93 / 0)

randomcat

Ars Tribunus Militum
3,368
OpenAI has likely known about diminishing returns in training LLMs for some time.

Maybe, but Altman continues to publicly insist that more compute resources = more better everything. Don't know if that's because engineers at OpenAI are sandbagging him with marginal workarounds or if he knows what's up and is lying through his teeth to get more investor cash.
 
Upvote
64 (65 / -1)

Sajuuk

Ars Legatus Legionis
10,716
Upon 4.5's release, OpenAI CEO Sam Altman did some expectation tempering on X, writing that the model is strong on vibes but low on analytical strength. "It is the first model that feels like talking to a thoughtful person to me," he wrote. He then added further down in his post, "a heads up: this isn't a reasoning model and won't crush benchmarks. it's a different kind of intelligence and there's a magic to it i haven't felt before."
So we're officially at "healing crystal woo" of selling AI.
 
Upvote
96 (99 / -3)

SomeChemist

Smack-Fu Master, in training
30
As someone who works in the sciences and not in programming or tech, I've found that LLMs are decent at translating documents (as you'd expect) and OK at giving executive summaries/30,000 ft. views on a topic. But the thing is, you can get those results from almost any of the LLMs, including the small, open source ones you can run locally on any decent laptop and so you don't really need something like GPT 4.5. And realistically, the 30,000 ft. view of a topic is something that a good google search 10 years ago would have given you. I recently stumbled upon Ed Zitron and it's hard to argue that he's wrong about the usefulness of LLMs.
 
Upvote
109 (113 / -4)
That sounds like what people said with Eliza or anything that speaks in natural sounding language.

What kind of intelligence is it? Street smarts? Can it survive on Skid Row?
It sounds to me what people said with GPT-4.0 vs GPT-3.5.

If you asked GPT-3.5 for advice, it could talk very naturally. But what it actually said was canned bullshit. Not just hallucinations or misunderstandings, but shallow. Like asking a stupid friend.

GPT-4.0 conversely, was capable of answering with some nuance. It didn't talk any more "naturally". But it might tell you something interesting that you didn't consider. Take a different angle, or connect some different ideas. Basically like a smart friend.

I hear the same in Karpathy's review. But who knows. With benchmarks so sucky, I'm suspicious.

Anyway, Chatbot-Arena will tell us in a week.
 
Last edited:
Upvote
-5 (17 / -22)
"Everything is a little bit better and it's awesome," he wrote, "but also not exactly in ways that are trivial to point to."

I could say the same thing about the shit I took today compared to the one I took yesterday.

In my example and his example, you cannot verify the claim, but you can be assured that it's all shit.
 
Upvote
65 (69 / -4)
Post content hidden for low score. Show…

Sajuuk

Ars Legatus Legionis
10,716
Are you suggesting co-marketing with Goop? It would make a fitting bookend to the human story.
"Whatever happened to those humans?"

Oh, they delegated all cognitive effort to black boxes they didn't understand. The black boxes said "drink raw milk and kill urself, nerds", and, because humans made sure the electric rock's highest stat was charisma, they listened!
 
Upvote
52 (55 / -3)
He is wrong. People are profiting on serving DeepSeek R1 as well as distilled models.
It is indeed very easy to profit on something when you got it for free then turned around and charged money for it.

The problem is there's no money in making new models, and making new models (that aren't just incremental fine-tunes of existing models) requires massive investment.
 
Upvote
72 (72 / 0)
"Whatever happened to those humans?"

Oh, they delegated all cognitive effort to black boxes they didn't understand. The black boxes said "drink raw milk and kill urself, nerds", and, because humans made sure the electric rock's highest stat was charisma, they listened!
“Once, men turned their thinking over to machines in the hope that this would set them free. But that only permitted other men with machines to enslave them.”

- Dune
 
Upvote
87 (89 / -2)
"Whatever happened to those humans?"

Oh, they delegated all cognitive effort to black boxes they didn't understand. The black boxes said "drink raw milk and kill urself, nerds", and, because humans made sure the electric rock's highest stat was charisma, they listened!
This would funnier if we didn't just hand HHS over to RFK.
 
Upvote
18 (22 / -4)

LG11

Ars Centurion
215
Subscriptor
"An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated)."

Edward Zitron has been pounding the drum in 20K word posts for a while now as well, basically saying AI:

has no viable consumer product
Is currently losing money on each user and loses more money for each new user added
Has largely run out of new data to scrape
Massively exponential increased costs do for each new generation do not notably improve quality of results
Quality of results have never met a reasonable level and throwing more "compute" (as he calls it) only increases costs

His lates missive...for someone who has 35 minutes to blow: https://www.wheresyoured.at/wheres-the-money/?ref=ed-zitrons-wheres-your-ed-at-newsletter
Incidentally, 'Zitron' means 'lemon' in German.

Yes, this was a bit off-topic, but perhaps in the spirit of AI saying irrelevant things ;)
 
Upvote
57 (58 / -1)

robco

Ars Scholae Palatinae
766
Subscriptor++
I'm curious to see when the gravy train runs out of track. I imagine it will be similar to the dawn of the century: investors will eventually slow or stop the money train, companies will fold, the secondhand market will be flooded with high-end hardware so new startups and remaining competitors don't have to buy new, hardware companies see sales go off a cliff, pandemonium ensues.

Plus ça change, plus c'est la même chose.
 
Upvote
56 (57 / -1)
Post content hidden for low score. Show…

stk5

Ars Scholae Palatinae
753
Subscriptor++
There hasn't been 'massive exponential increased costs'. Indeed often the opposite, for the same given compute budget you get massively better results.
What are the value and unit of measurement that describe one model giving massively better results than another?
 
Upvote
16 (17 / -1)
What are the value and unit of measurement that describe one model giving massively better results than another?
Elo, benchmarks, reputation, and if you use them regularly, you find out yourself.

It depends a bit on what you're asking. If your request has an objective answer-- or even a widely agreed upon answer-- it's easy to see which AI is best.

If you want a Chatbot to write you a poem or a story, you'll probably need to do your own testing, just because people have wildly different notions of what is "good".
 
Last edited:
Upvote
-13 (6 / -19)
What's working in favor of China is TikTok et all. Chinese companies have a never ending funnel of new data they can use to improve the system with zero cost and zero morale objections. Why bother with synthetic data? If it's any good, the Americans will use it, the Chinese can steal it (again....again), and augment all the other data they're being fed.

It's like the "Food Chain" from the Simpsons back in the 1990s. Except the human is Chinese AI companies and all the animals are the rest of the world they're stealing from.

View attachment 103862
The problem is they’ve already scraped almost all the world’s data, including from TikTok. There is literally not enough remaining human-produced content in the entire world for the models to continue scaling up, which is why they’re turning to synthetic data.

(Which is just a fancy way of saying “we’re feeding our bullshit machine the bullshit another bullshit machine produced”)
 
Upvote
63 (67 / -4)
I'm curious to see when the gravy train runs out of track. I imagine it will be similar to the dawn of the century: investors will eventually slow or stop the money train, companies will fold, the secondhand market will be flooded with high-end hardware so new startups and remaining competitors don't have to buy new, hardware companies see sales go off a cliff, pandemonium ensues.

Plus ça change, plus c'est la même chose.
Shorting Nvidia seems like the obvious move here.
 
Upvote
7 (11 / -4)

The Lurker Beneath

Ars Scholae Palatinae
5,589
Subscriptor
What's working in favor of China is TikTok et all. Chinese companies have a never ending funnel of new data they can use to improve the system with zero cost and zero morale objections. Why bother with synthetic data? If it's any good, the Americans will use it, the Chinese can steal it (again....again), and augment all the other data they're being fed.

It's like the "Food Chain" from the Simpsons back in the 1990s. Except the human is Chinese AI companies and all the animals are the rest of the world they're stealing from.

View attachment 103862

Might want to rethink the bat.
 
Upvote
16 (16 / 0)

42Kodiak42

Ars Scholae Palatinae
779
[...]

They are comparable to human experts at many tasks for a tiny percentage for inference cost, people complain about hallucination but often hallucination is lower than human experts on the same task (Ie MDs doing a differential diagnosis). Progress continues as we devise new benchmark test sets that aren't saturated.
There's a very important part of software tools that you're missing with this point: Software can and will be treated as an unerring authority if it takes more than a second to realize it produced an incorrect answer. People do not treat software tools like Doctor Jim who sometimes has a long day, who sometimes misunderstands what you meant by "minor bleeding."

People treat software tools like calculators, and calculators are only allowed to be wrong if one of three conditions are met:
  • The user made a mistake on their end
  • The software has identified that it has failed or will fail to produce a correct answer
  • The software accurately calculates the odds of its answer being incorrect due to random variations inherent to the problem being solved.
Hallucinations are not caused by user error, are not identified by the operational software, and the odds of the operation producing a hallucination are not accurately calculated because they do not result from uncertainties inherent to the user's query, but from undocumented gaps and conflicts in the LLM's training data.
OpenAI's SimpleQA Hallucination Rate benchmark cannot tell you the chances of a model hallucinating for a given prompt, it's the percentage of queries in a sample pool that resulted in hallucinations, using those benchmarks as a 'chance of incorrect information' is a statistical overgeneralization.

No hallucination warning label is going to get someone to double check a model's answer before they put it into practice. If a user was willing to do the work to double check an LLM before using the LLM's answer, they wouldn't be using an LLM in the first place.
 
Upvote
58 (64 / -6)
The problem is they’ve already scraped almost all the world’s data, including from TikTok. There is literally not enough remaining human-produced content in the entire world for the models to continue scaling up, which is why they’re turning to synthetic data.

(Which is just a fancy way of saying “we’re feeding our bullshit machine the bullshit another bullshit machine produced”)
This may be true for 'public' data that anyone can get, but this data is a fraction of 'human produced content', and is the least useful as its full of garbage. The truly valuable and curated stuff is behind corporate walls and not available to neither OpenAI, nor any of their competitors.
 
Upvote
2 (7 / -5)

t-doggy

Smack-Fu Master, in training
1
An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!"

I’m really curios why Ars would give this person anonymity, and not explain who they are a bit more than “an AI expert.” Is the AI expert working at a competitor to OpenAI? Context matters, especially for anons.
 
Upvote
36 (41 / -5)
Post content hidden for low score. Show…