GPT-4.5 offers marginal gains in capability and poor coding performance despite 30x the cost.
See full article...
See full article...
What kind of tasks do you use it for when coding? Genuinely curiousDisagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.
A few days too late on that one. Would have been a beautiful bet. Not so much now (probably).Shorting Nvidia seems like the obvious move here.
To be fair, people also treat doctors like they are unerring authorities. It is actually a huge problem, and has been throughout all of history.Software can and will be treated as an unerring authority if it takes more than a second to realize it produced an incorrect answer. People do not treat software tools like Doctor Jim
Yeah, how much a person should trust a doctor is a much more complex issue with an incredibly wide variety of contributing factors from the understaffing of healthcare to the inherent conflict between a person's need for things to get better versus limits on our resources, understanding of biology, and information regarding what's actually wrong with a person's body.To be fair, people also treat doctors like they are unerring authorities. It is actually a huge problem, and has been throughout all of history.
The answer isn't to make doctors perfect. It is to hold them accountable.
Emphasis mineDisagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.
Though, to be fair, Marcus is right about that too.frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated).
Well, we try to solve the problem, but we also simply accept that doctors misdiagnose and mistreat patients with some statistical certainty. Fundamentally, we trust them because we have no alternative, and some of us get unlucky.Yeah, how much a person should trust a doctor is a much more complex issue with an incredibly wide variety of contributing factors from the understaffing of healthcare to the inherent conflict between a person's need for things to get better versus limits on our resources, understanding of biology, and information regarding what's actually wrong with a person's body.
My main point is that when we're working with humans, we actually have methods of predicting failure points and ways to account for a reasonable degree of human error. We don't really have a good way of predicting or accounting for the hallucination of an LLM beyond "Don't use them if you'll face consequences for trying a wrong answer."
Maybe, but Altman continues to publicly insist that more compute resources = more better everything. Don't know if that's because engineers at OpenAI are sandbagging him with marginal workarounds or if he knows what's up and is lying through his teeth to get more investor cash.
How can any of this ever make money? Once the platform holders have “good enough” models of their own and it’s all just baked into the products everyone is already using…Who cares any more?
Article said:It's worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.
(though to be fair, Marcus also seems to think most of what OpenAI does is overrated).
Well, if people want to query said model about any sort of real world events then someone is going to have to pay to keep producing new models with updated data sets on a very regular basis.
Which probably makes a lot of the potential economic models even less likely to really work out.
4o-mini is the same price or cheaper and clearly better.At this point I think 3.5 is the high water mark for bang-for-buck.
Were I to hazard a guess, I'd expect the issue is all about the data quality.Unsurprising that it sucks at every "thinking" task, compared to "thinking" models.
Somewhat surprising that this gargantuan beast is weaker than the "non-thinking" Sonnet though. I suppose anthropic figured some more things out.
Okay. Point being, the big models aren't worth enough to go chasing them. Bigger models.. not super optimistic they will be either.4o-mini is the same price or cheaper and clearly better.
That's not entirely true, while it's obviously ideal to have an LLM with updated knowledge baked in, integrating Web Search and RAG has been popular for a while now. And models have largely been trained to make use of external results and prioritizing them over their own knowledge.
So even if you are using a model with an old cutoff, you can still query about new events. Producing new models regularly is not being done right now, and is not something I can ever see being common due to the large costs it would bring
The truly valuable and curated stuff is NOT behind corporate walls but behind copyrights - unless you consider the stuff that MBAs and consultants spew out to be more valuable than books and other sources of curated knowledge.This may be true for 'public' data that anyone can get, but this data is a fraction of 'human produced content', and is the least useful as its full of garbage. The truly valuable and curated stuff is behind corporate walls and not available to neither OpenAI, nor any of their competitors.
All depending on use case. For a quick chat back and forth, likely not. But if you’re writing a piece that you are spending an hour or more on, whats a buck or two to use the absolute best quality models for the best results.Okay. Point being, the big models aren't worth enough to go chasing them. Bigger models.. not super optimistic they will be either.
My bet is that a large portion of the work for this model was done a while back, and what do you do if you’ve already spent many millions training this model? Just scrap it? It is better at some things so why not just release it with a caveat.Altman is either an idiot, or he thinks the rest of us are. Bro is seriously all-in on the idea that using far more energy is the way to improve models, at a time when DeepSeek has proven that that is not necessary.
Plot twist, it’s Sam Altman!I’m really curios why Ars would give this person anonymity, and not explain who they are a bit more than “an AI expert.” Is the AI expert working at a competitor to OpenAI? Context matters, especially for anons.
frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated).
I figure their instruct dataset is just much much cleaner than OpenAI. One thing I’ve noticed about Anthropic is a tendency towards minimalism.I suppose anthropic figured some more things out.
Thanks for this. It was a very interesting read."An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated)."
Edward Zitron has been pounding the drum in 20K word posts for a while now as well, basically saying AI:
has no viable consumer product
Is currently losing money on each user and loses more money for each new user added
Has largely run out of new data to scrape
Massively exponential increased costs do for each new generation do not notably improve quality of results
Quality of results have never met a reasonable level and throwing more "compute" (as he calls it) only increases costs
His lates missive...for someone who has 35 minutes to blow: https://www.wheresyoured.at/wheres-the-money/?ref=ed-zitrons-wheres-your-ed-at-newsletter
Indeed, although I think the critics and enthusiasts come at this from different angles (I count myself in both camps, oddly enough):i think the "lemon" narrative is missing the point. traditional scaling isn't necessarily failing, it's evolving into something more specialized
mercury breaks speed barriers with diffusion. cod shows that 5-word reasoning steps beat verbose chains. grok3 trained on 200k (and is a great model, may be the best overall rn) h100s while our rumored 20k produced 4.5. these aren't contradictions, they're parallel explorations
what's truly fascinating: we're finally hitting the natural limits of one approach while simultaneously advancing others. as for OpenAI, o1 pro and o3-mini demonstrate they understand the value of specialized reasoning. 4.5 shows us exactly where traditional scaling plateaus, not a failure but a crucial datapoint (kudos to openai for showing their work, this advances the science)
the compute asymmetry is the untold story. when someone can assemble colossus in 4 months while OpenAI is "out of gpus" at launch, that's not a supply chain footnote, it's a fundamental shift in the landscape
what critics miss: 4.5 was never meant to be the final evolution but a necessary reference point. every plateau in one direction enables acceleration in others. we needed to know exactly where the diminishing returns kick in to focus our efforts elsewhere
traditional scaling, reasoning models, diffusion architectures, token efficiency: these aren't competing approaches, they're puzzle pieces. the real breakthrough comes when someone figures out how to combine them effectively (i'm rooting for this guy)
this isn't an ending. it's a branching point
My admittedly idiosyncratic use cases and non-systematic impression is that Claude 3.7 Sonnet gives far more impressively creative/nuanced responses than OpenAI, and by far.I figure their instruct dataset is just much much cleaner than OpenAI. One thing I’ve noticed about Anthropic is a tendency towards minimalism.
This applies both to their API itself and their models’ responses. OpenAI has more features but the API is not as clean and the responses sometimes too verbose.
You know, I actually fully agree with you here, but also can't help but nitpick:There's a very important part of software tools that you're missing with this point: Software can and will be treated as an unerring authority if it takes more than a second to realize it produced an incorrect answer. People do not treat software tools like Doctor Jim who sometimes has a long day, who sometimes misunderstands what you meant by "minor bleeding."
People treat software tools like calculators, and calculators are only allowed to be wrong if one of three conditions are met:
Hallucinations are not caused by user error, are not identified by the operational software, and the odds of the operation producing a hallucination are not accurately calculated because they do not result from uncertainties inherent to the user's query, but from undocumented gaps and conflicts in the LLM's training data.
- The user made a mistake on their end
- The software has identified that it has failed or will fail to produce a correct answer
- The software accurately calculates the odds of its answer being incorrect due to random variations inherent to the problem being solved.
OpenAI's SimpleQA Hallucination Rate benchmark cannot tell you the chances of a model hallucinating for a given prompt, it's the percentage of queries in a sample pool that resulted in hallucinations, using those benchmarks as a 'chance of incorrect information' is a statistical overgeneralization.
No hallucination warning label is going to get someone to double check a model's answer before they put it into practice. If a user was willing to do the work to double check an LLM before using the LLM's answer, they wouldn't be using an LLM in the first place.
If you don't check, you can't know it actually works as intended, regardless of whether it compiles or passes tests.Disagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.
AI companies don't care about copyrights, except when they are caught. If the copyrighted work is not behind walls, it will be in the training data.The truly valuable and curated stuff is NOT behind corporate walls but behind copyrights - unless you consider the stuff that MBAs and consultants spew out to be more valuable than books and other sources of curated knowledge.
Don't reasoning models need more tokens for a given task, though, which would make cost comparisons much less straightforward?By contrast, OpenAI's flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.
That seems... like a bad idea if you're doing something more than short scripts.Disagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.