“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Celery Man · Feb 28, 2025

gnesterenko said:
Disagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.

What kind of tasks do you use it for when coding? Genuinely curious

gnesterenko · Feb 28, 2025

Celery Man said:
Shorting Nvidia seems like the obvious move here.

A few days too late on that one. Would have been a beautiful bet. Not so much now (probably).

internetomancer · Feb 28, 2025

42Kodiak42 said:
Software can and will be treated as an unerring authority if it takes more than a second to realize it produced an incorrect answer. People do not treat software tools like Doctor Jim

To be fair, people also treat doctors like they are unerring authorities. It is actually a huge problem, and has been throughout all of history.

The answer isn't to make doctors perfect. The answer is to acknowledge their imperfection, and deal with that.

42Kodiak42 · Feb 28, 2025

internetomancer said:
To be fair, people also treat doctors like they are unerring authorities. It is actually a huge problem, and has been throughout all of history.

The answer isn't to make doctors perfect. It is to hold them accountable.

Yeah, how much a person should trust a doctor is a much more complex issue with an incredibly wide variety of contributing factors from the understaffing of healthcare to the inherent conflict between a person's need for things to get better versus limits on our resources, understanding of biology, and information regarding what's actually wrong with a person's body.

My main point is that when we're working with humans, we actually have methods of predicting failure points and ways to account for a reasonable degree of human error. We don't really have a good way of predicting or accounting for the hallucination of an LLM beyond "Don't use them if you'll face consequences for trying a wrong answer."

JoHBE · Feb 28, 2025

"It is the first model that feels like talking to a thoughtful person to me," he wrote. He then added further down in his post, "a heads up: this isn't a reasoning model and won't crush benchmarks. it's a different kind of intelligence and there's a magic to it i haven't felt before."

Ah, the absolutely delightful smell of utter desperation, detectable accross the Atlantic.

42Kodiak42 · Feb 28, 2025

gnesterenko said:
Disagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.

Emphasis mine
Coding is one of the very few applications where there are no consequences for trying a wrong answer. By sending it to the compiler, you are performing a practical test on that answer without double checking it yourself. For applications where you can test an answer without consequence, hallucinations are only minor inefficiencies in an otherwise effective workflow.

Compare this to the blanket application of Google's AI overview across all searches. The very first time I got an AI generated result, it was when I was looking up appropriate protection measures for working with highly toxic and most likely carcinogenic materials. I didn't waste my time reading that generated result; I wasn't about to trust my lungs with Google's AI so I had to find a credible source anyway.

LLMs are useful when you only need to redo the work yourself when they're wrong, but dangerous when you're facing an increased risk of lung cancer when they're wrong.

TVPaulD · Feb 28, 2025

frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated).

Though, to be fair, Marcus is right about that too.

This technology is useful. But it’s not particularly special and there’s no real differentiation between the models. Some of them are better in some areas, some are better in others. But they all fundamentally do the same things in the same ways.

OpenAI isn’t a particularly important company in that sense. Once the models exist, someone else will make one almost entirely fungible with it, so you don’t need OpenAI. Natural language interfaces are good and have advantages, used in the right ways they can perform useful tasks, but in and of themselves they don’t seem to have any capacity for making whole new products.

If OpenAI or anyone else had something in the oven that would actually shift the needle, they would tell us. They are not shy about insisting their stuff is going to change the world, so if they had an actual idea of how, they have no reason to tell us nothing about it. It stands to reason there’s no there there. It’s just this. This is it. It’s impressive for what it is, and it’s fine and all, but it’s not the game changer of all time it’s portrayed as and nobody has any special sauce version of it that can’t be matched.

How can any of this ever make money? Once the platform holders have “good enough” models of their own and it’s all just baked into the products everyone is already using…Who cares any more? It’s a commodity. Nobody’s going to chase an intangible improvement in model performance over just using the one that is already integrated with the things they use and like already. No huge market is going to emerge for people wanting to subscribe to a service providing LLM functions once their phone and their laptop just do it anyway. Enterprises will probably do it where it makes sense and a minority of the public who are enthusiasts.

But almost nobody paid for Search. Almost nobody would pay for a web browser in the end. Heck, people barely even pay for Apps. That specifically add functionality to their devices. They really gonna pay for “the thing you already have, but better in ways even the people who made it can neither quantify or clearly explain”?

Give me a break.

internetomancer · Feb 28, 2025

42Kodiak42 said:
Yeah, how much a person should trust a doctor is a much more complex issue with an incredibly wide variety of contributing factors from the understaffing of healthcare to the inherent conflict between a person's need for things to get better versus limits on our resources, understanding of biology, and information regarding what's actually wrong with a person's body.

My main point is that when we're working with humans, we actually have methods of predicting failure points and ways to account for a reasonable degree of human error. We don't really have a good way of predicting or accounting for the hallucination of an LLM beyond "Don't use them if you'll face consequences for trying a wrong answer."

Well, we try to solve the problem, but we also simply accept that doctors misdiagnose and mistreat patients with some statistical certainty. Fundamentally, we trust them because we have no alternative, and some of us get unlucky.

Some of the solutions could be reworked for AI. We can study and predict hallucinations. We can apply second opinions... Sadly the solution I'd most like is currently infeasible-- the ability to sue them for a lot of money. We can't do that because (right now) AIs don't charge you money. Doctors charge tons of money, and then put some aside for malpractice insurance. LLMs can't charge anything at all, so they can't afford any lawsuits. I'm curious if/when/how some future product could offer that same accountability.

The closest I can think of is self-driving cars. But Waymo can charge the same price as Uber, so the safety dilemma is (sort of) priced in.

Qwertilot · Feb 28, 2025

randomcat said:
Maybe, but Altman continues to publicly insist that more compute resources = more better everything. Don't know if that's because engineers at OpenAI are sandbagging him with marginal workarounds or if he knows what's up and is lying through his teeth to get more investor cash.

To be fair, for a good while scaling up has worked remarkably well. Far better than you'd have thought a priori, and for far longer. Not just incremental improvements, but bunches of qualitatively new (seeming?) behaviour popping out.

So it actually wasn't terribly unreasonable to think that it might continue that way for some time longer, which could have reached interesting places.

Of course if they have now hit a brick wall (data, techniques, who knows?) then there are considerable economic issues and the bubble should be on it's way to bursting.

Qwertilot · Feb 28, 2025

TVPaulD said:
How can any of this ever make money? Once the platform holders have “good enough” models of their own and it’s all just baked into the products everyone is already using…Who cares any more?

Well, if people want to query said model about any sort of real world events then someone is going to have to pay to keep producing new models with updated data sets on a very regular basis.

Which probably makes a lot of the potential economic models even less likely to really work out.

mikael110 · Feb 28, 2025

Article said:
It's worth noting that Claude 3.7 Sonnet is likely a system of AI models working together behind the scenes, although Anthropic has not provided details about its architecture.

Actually Anthropic has provided some details about how the thinking process works in their blog.

It's pretty surface level, but they do provide some direct information. Like explicitly stating that the thinking is done by the same model that outputs the answer. And notably they mentioned that while they have experimented with having multiple thinking passes generated at once, and picking the best one using a separate model. They have not actually implemented this in the current model.

So currently they still rely on one model processing the thinking in serial. Which is also evident by just looking at its output, as you can see the thoughts being streamed in real-time with no real delay.

Ailuridae · Feb 28, 2025

(though to be fair, Marcus also seems to think most of what OpenAI does is overrated).

And he is right!

wildsman · Feb 28, 2025

Agreed on 4.5 - it's not an improvement.

However, Claude 3.7 is getting rave reviews.

Competition is good.

mikael110 · Feb 28, 2025

Qwertilot said:
Well, if people want to query said model about any sort of real world events then someone is going to have to pay to keep producing new models with updated data sets on a very regular basis.

Which probably makes a lot of the potential economic models even less likely to really work out.

That's not entirely true, while it's obviously ideal to have an LLM with updated knowledge baked in, integrating Web Search and RAG has been popular for a while now. And models have largely been trained to make use of external results and prioritizing them over their own knowledge.

So even if you are using a model with an old cutoff, you can still query about new events. Producing new models regularly is not being done right now, and is not something I can ever see being common due to the large costs it would bring.

odikweos · Feb 28, 2025

At this point I think 3.5 is the high water mark for bang-for-buck.

bigcheese · Feb 28, 2025

After spending some time with it, 4.5 is arguably better at writing as a human would. More relatable and less robotic scripted feel. For most tasks though, its not worth it over 4o or o3 mini.

bigcheese · Feb 28, 2025

odikweos said:
At this point I think 3.5 is the high water mark for bang-for-buck.

4o-mini is the same price or cheaper and clearly better.

Fatesrider · Feb 28, 2025

internetomancer said:
Unsurprising that it sucks at every "thinking" task, compared to "thinking" models.

Somewhat surprising that this gargantuan beast is weaker than the "non-thinking" Sonnet though. I suppose anthropic figured some more things out.

Were I to hazard a guess, I'd expect the issue is all about the data quality.

The push was to throw more "data" at the models, in some misguided belief that "anything is good!" as long as it's "more". Well, you can fill a plate with good food, or you can fill a bucket with inedible slop.

There's a lot of inedible slop data out there that is contradictory, nonsensical, insipid or just plain psychotic. What do you get when you train your model on all things Internet? You get average, at BEST. And average for humans is not what most people would consider "best".

I mean, I could be wrong about that. But if you're giving equal treatment to the content of what's out there WRT readability, reliability, accuracy, etc., then not only are you setting yourself up for a fall, you're tying the knot in the rope and putting your head into it with one foot on the platform release lever. You're going to fall, and choke on the results.

odikweos · Feb 28, 2025

bigcheese said:
4o-mini is the same price or cheaper and clearly better.

Okay. Point being, the big models aren't worth enough to go chasing them. Bigger models.. not super optimistic they will be either.

Qwertilot · Feb 28, 2025

mikael110 said:
That's not entirely true, while it's obviously ideal to have an LLM with updated knowledge baked in, integrating Web Search and RAG has been popular for a while now. And models have largely been trained to make use of external results and prioritizing them over their own knowledge.

So even if you are using a model with an old cutoff, you can still query about new events. Producing new models regularly is not being done right now, and is not something I can ever see being common due to the large costs it would bring

Well, yes, but they are producing new models very regularly right now. Not as explicit updates to the underlying data sure, but that must be a side effect?

If they stop trying to scale things to get new performance then it'll either be semi regular retraining or just letting the models worth slowly decay over time.
(Certainly chat bots & programming assistants will need regular refreshers?)

You can imagine either fate, of course, depending on whether or not they can ever work out how to make money out of it all.

WhatDoesTheFoxSay · Feb 28, 2025

Altman is either an idiot, or he thinks the rest of us are. Bro is seriously all-in on the idea that using far more energy is the way to improve models, at a time when DeepSeek has proven that that is not necessary.

arc-tu-rus · Feb 28, 2025

gnesterenko said:
This may be true for 'public' data that anyone can get, but this data is a fraction of 'human produced content', and is the least useful as its full of garbage. The truly valuable and curated stuff is behind corporate walls and not available to neither OpenAI, nor any of their competitors.

The truly valuable and curated stuff is NOT behind corporate walls but behind copyrights - unless you consider the stuff that MBAs and consultants spew out to be more valuable than books and other sources of curated knowledge.

ptlambert · Feb 28, 2025

It's somewhat intriguing in that these models are becoming more human. That is, some are better at language and creative tasks, like this one, while others are more geared towards logic, math, and coding. The former are more pattern matchers and generators, while the latter do a form of guided brute-force search as reasoning. If you combine these two into one, is that all that humanity is? Or do we have something more, something that computers could never match -- a common theme in sci-fi?

bigcheese · Feb 28, 2025

odikweos said:
Okay. Point being, the big models aren't worth enough to go chasing them. Bigger models.. not super optimistic they will be either.

All depending on use case. For a quick chat back and forth, likely not. But if you’re writing a piece that you are spending an hour or more on, whats a buck or two to use the absolute best quality models for the best results.

As for model scaling, everyone agrees that this is a dead end. The next frontier is reasoning and inference time compute.

bigcheese · Feb 28, 2025

WhatDoesTheFoxSay said:
Altman is either an idiot, or he thinks the rest of us are. Bro is seriously all-in on the idea that using far more energy is the way to improve models, at a time when DeepSeek has proven that that is not necessary.

My bet is that a large portion of the work for this model was done a while back, and what do you do if you’ve already spent many millions training this model? Just scrap it? It is better at some things so why not just release it with a caveat.

Would it have been trained the first place if they’d known the price/performance beforehand? Likely not. It seems like a bet to see how far you could push the scaling laws of existing architectures given that they were flush with cash and had the hardware to spare.

sigmasirrus · Feb 28, 2025

t-doggy said:
I’m really curios why Ars would give this person anonymity, and not explain who they are a bit more than “an AI expert.” Is the AI expert working at a competitor to OpenAI? Context matters, especially for anons.

Plot twist, it’s Sam Altman!

TranslateDoggie · Feb 28, 2025

frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated).

Hmm! I like him already.

Psyborgue · Feb 28, 2025

internetomancer said:
I suppose anthropic figured some more things out.

I figure their instruct dataset is just much much cleaner than OpenAI. One thing I’ve noticed about Anthropic is a tendency towards minimalism.

This applies both to their API itself and their models’ responses. OpenAI has more features but the API is not as clean and the responses sometimes too verbose.

AI is cool i guess · Feb 28, 2025

i think the "lemon" narrative is missing the point. traditional scaling isn't necessarily failing, it's evolving into something more specialized

mercury breaks speed barriers with diffusion. cod shows that 5-word reasoning steps beat verbose chains. grok3 trained on 200k (and is a great model, may be the best overall rn) h100s while our rumored 20k produced 4.5. these aren't contradictions, they're parallel explorations

what's truly fascinating: we're finally hitting the natural limits of one approach while simultaneously advancing others. as for OpenAI, o1 pro and o3-mini demonstrate they understand the value of specialized reasoning. 4.5 shows us exactly where traditional scaling plateaus, not a failure but a crucial datapoint (kudos to openai for showing their work, this advances the science)

the compute asymmetry is the untold story. when someone can assemble colossus in 4 months while OpenAI is "out of gpus" at launch, that's not a supply chain footnote, it's a fundamental shift in the landscape

what critics miss: 4.5 was never meant to be the final evolution but a necessary reference point. every plateau in one direction enables acceleration in others. we needed to know exactly where the diminishing returns kick in to focus our efforts elsewhere

traditional scaling, reasoning models, diffusion architectures, token efficiency: these aren't competing approaches, they're puzzle pieces. the real breakthrough comes when someone figures out how to combine them effectively (i'm rooting for this guy)

this isn't an ending. it's a branching point

10Nov1775 · Feb 28, 2025

MHStrawn said:
"An AI expert who requested anonymity told Ars Technica, "GPT-4.5 is a lemon!" when comparing its reported performance to its dramatically increased price, while frequent OpenAI critic Gary Marcus called the release a "nothing burger" in a blog post (though to be fair, Marcus also seems to think most of what OpenAI does is overrated)."

Edward Zitron has been pounding the drum in 20K word posts for a while now as well, basically saying AI:

has no viable consumer product
Is currently losing money on each user and loses more money for each new user added
Has largely run out of new data to scrape
Massively exponential increased costs do for each new generation do not notably improve quality of results
Quality of results have never met a reasonable level and throwing more "compute" (as he calls it) only increases costs

His lates missive...for someone who has 35 minutes to blow: https://www.wheresyoured.at/wheres-the-money/?ref=ed-zitrons-wheres-your-ed-at-newsletter

Thanks for this. It was a very interesting read.

I think he's far too dismissive of the utility of these models—nearly as hyperbolic as the breathless prophets of general AI—but quite right about the economics of them, as currently scaled.

More to the point, he put a lot of numbers to his argument, and, fittingly for a PR person, picks apart the ambiguous results reporting of these companies (which is always suggestive of hoodwinking the public, or investors, or both).

Currently, outside of extremely niche uses, which are very difficult to valuate, such as protein prediction software...LLMs are just not useful enough for the actual costs involved, and as he repeatedly points out, the technology is commodified, easily copied, and as the jargon goes, "moatless".

(In business, it generally refers to a business endeavor where your key value proposition—whatever makes you #1 in a space—has no lead time/defense against some competitor coming along and doing exactly the same thing as you are, or even potentially leapfrogging you. c.f. e.g. xAI getting OpenAI levels of performance in a very short time frame, and Deepseek suddenly doing almost as well but much more cheaply on a shorter lead time.)

10Nov1775 · Feb 28, 2025

AI is cool i guess said:
i think the "lemon" narrative is missing the point. traditional scaling isn't necessarily failing, it's evolving into something more specialized

mercury breaks speed barriers with diffusion. cod shows that 5-word reasoning steps beat verbose chains. grok3 trained on 200k (and is a great model, may be the best overall rn) h100s while our rumored 20k produced 4.5. these aren't contradictions, they're parallel explorations

what's truly fascinating: we're finally hitting the natural limits of one approach while simultaneously advancing others. as for OpenAI, o1 pro and o3-mini demonstrate they understand the value of specialized reasoning. 4.5 shows us exactly where traditional scaling plateaus, not a failure but a crucial datapoint (kudos to openai for showing their work, this advances the science)

the compute asymmetry is the untold story. when someone can assemble colossus in 4 months while OpenAI is "out of gpus" at launch, that's not a supply chain footnote, it's a fundamental shift in the landscape

what critics miss: 4.5 was never meant to be the final evolution but a necessary reference point. every plateau in one direction enables acceleration in others. we needed to know exactly where the diminishing returns kick in to focus our efforts elsewhere

traditional scaling, reasoning models, diffusion architectures, token efficiency: these aren't competing approaches, they're puzzle pieces. the real breakthrough comes when someone figures out how to combine them effectively (i'm rooting for this guy)

this isn't an ending. it's a branching point

Indeed, although I think the critics and enthusiasts come at this from different angles (I count myself in both camps, oddly enough):

The critics tend to see the actual and potential negative impacts, the complete lack of a sustainable product with a defined market niche and userbase, and the overblown 'general AI tomorrow' claims, and rightfully excoriating these problems.

Which, if you look at OpenAI and other providers as a business, they're perfectly right. The numbers are appallingly bad, and that's before environmental externalities are priced in...

The enthusiasts, on the other hand, tend to be interested in this as blue sky/moonshot basic research in a rapidly developing field—to see the potential futures and think, well, if this can improve in X and Y way, and the resource expenditure comes down a lot, this will probably revolutionize every industry, in much the way inventions like the telegraph, and its younger brother, the internet, did.

—though I think there is another subcamp, the geopoliticals, who rightly recognize the potentially disruptive nature of these technologies, and who may or may not be correct in fretting about "AI gaps", Cold War style. These folks think that if there is even a chance this tech fulfills its potential, the payoffs and risks are so huge that it justifies enormous expense with little to show—because even learning that this is a dead end is extremely valuable, strategically.

This calculus is also quite normal for geostrategic defense applications, I should add—while it has led to some significant boondoggles, there have been enormous and obvious strategic advantages for e.g. the U.S. pursuing relatively unproven blue sky military tech. Massive levels of investment have often been the norm for these kinds of projects with uncertain returns, at least in the countries that can afford them.

I think the biggest splitting off point between all of the different viewpoints, though, is actually how you think the potential of the tech should be evaluated:

Many view the progression to a sustainable, usable, disruptive technology as, essentially, an engineering issue—something which can surely be solved, it's simply a matter of finding out how.

This faith tends to be predicated on a lot of things, including such abstract issues as whether you view consciousness as likely to be computable or not, which evangelists almost universally answer affirmatively.

Others think you should evaluate it purely on what we can see and reasonably predict right now—these folks see a technology with no true use case as at any environmental or financial cost that is sustainable. This view says that depending on future breakthroughs is a fool's game, and tends to value harm now as more important than putative future goods that may never be realized.

And then lastly, as above, there are those who tend to see the potential risks as most important—and which risks you worry about defines whether this makes you vehemently against, or for, the pursuit of these technologies.

If you worry about Chinese strategic dominance or U.S. economic punishment via loss of AI leadership, you may think even a very small chance of these outcomes occurring is worth a lot of burnt capital.

If you take very seriously the notion of a superhuman intelligence destroying or enslaving the human race, or perhaps other issues arising from AI (e.g. aids to government dragnet surveillance, propaganda at scale, etc.)—then you probably are vehemently against the development of these at all, and at most advocate for nonprofit/governmental/scientific research chiefly aimed at preventing these outcomes...and the idea of private for-profit business monopolizing this tech probably terrifies you.

Which is all to say, like many hugely divisive topics, every side is convinced of their own rightness because their viewpoint, though often rigid, does still contain some powerful aspect of truth or plausibility. Every camp has pretty good points, but they tend to talk past each other, and to interpret other arguments only from their own parochial viewpoints.

10Nov1775 · Feb 28, 2025

Psyborgue said:
I figure their instruct dataset is just much much cleaner than OpenAI. One thing I’ve noticed about Anthropic is a tendency towards minimalism.

This applies both to their API itself and their models’ responses. OpenAI has more features but the API is not as clean and the responses sometimes too verbose.

My admittedly idiosyncratic use cases and non-systematic impression is that Claude 3.7 Sonnet gives far more impressively creative/nuanced responses than OpenAI, and by far.

I've fiddled with these models for many different curiosity based use cases, but I was the most impressed by Claude's ability to give context aware, clever, ironical dialogue—even when otherwise being perfectly useless, lol, such as the time is devolved a story into two characters delivering everything in the form of flirtatious science puns and double entendres—the level of context aware irony and metaphor it came up with we truly something. Incidentally, it also wrote better poetry.

So far Gemini seems the strongest for research purposes or extremely complicated multi-step processes—albeit the former requires careful, formalistic wording to get good results, such as the use of subject specific jargon, which gets much more specialist oriented responses.

But I was very impressed by Gemini's ability to fill out a ridiculously detailed character sheet format I created—it was a form several pages long, with very specific and complex instructions as to how to fill it out, and it included quite a few multi-paragraph freeform sections, e.g. inventing a family for the character, inventing notable trauma histories for characters, giving generic social histories, etc....all of which needed to be consistent with a 2-3 paragraph description I gave it of the character, which also included some history and characteristics, in addition to more vague descriptors.

In every case I checked, invented birthdates matched the age of the character I gave, personal details were always consistent and often surprisingly intertextual, etc. There were extrapolations I would not have considered the model capable of, knowing how it worked.

Of course, it also failed to generate any output about 2/3 times, necessitating repeated resubmission, and it also tended to cut of the form, but then, it also correctly continued to fill the form out when prompted.

They're neat toys and basic research aids. Subsidized for me by someone else's money, to boot.

10Nov1775 · Feb 28, 2025

42Kodiak42 said:
There's a very important part of software tools that you're missing with this point: Software can and will be treated as an unerring authority if it takes more than a second to realize it produced an incorrect answer. People do not treat software tools like Doctor Jim who sometimes has a long day, who sometimes misunderstands what you meant by "minor bleeding."

People treat software tools like calculators, and calculators are only allowed to be wrong if one of three conditions are met:

The user made a mistake on their end

The software has identified that it has failed or will fail to produce a correct answer

The software accurately calculates the odds of its answer being incorrect due to random variations inherent to the problem being solved.

Hallucinations are not caused by user error, are not identified by the operational software, and the odds of the operation producing a hallucination are not accurately calculated because they do not result from uncertainties inherent to the user's query, but from undocumented gaps and conflicts in the LLM's training data.
OpenAI's SimpleQA Hallucination Rate benchmark cannot tell you the chances of a model hallucinating for a given prompt, it's the percentage of queries in a sample pool that resulted in hallucinations, using those benchmarks as a 'chance of incorrect information' is a statistical overgeneralization.

No hallucination warning label is going to get someone to double check a model's answer before they put it into practice. If a user was willing to do the work to double check an LLM before using the LLM's answer, they wouldn't be using an LLM in the first place.

You know, I actually fully agree with you here, but also can't help but nitpick:

1. I doublecheck LLM answers every time, and generally only ask it questions that are hard to produce at the same speed it does, but are (relatively) easy to doublecheck.

(That, or as fodder for further searching—trying to find out WHAT to start looking at, or looking for, or where to start such research, or trying to discover what the jargon terms for what I am trying to learn about are. In this it's similar to how I often use Wikipedia, where I follow a lot of the citations, and often use it as a convenient starting bibliography.)

People won't do this at scale, you're right, but some of us do. I bounce ideas off of people, too, even though I double check them as well. And myself, for that matter.

2. Having experience in medicine, let me tell you, people actually do tend to treat doctors as infallible, and the folks who inveterately question every decision rarely do so constructively or for good reasons.

But I do still agree that people accept doctors as more fallible than computers, and that they see LLMs as more like computers than like known-to-be-fallible human beings

johnlsenchak · Feb 28, 2025

I asked DeepSeek about more technical information about the Great Firewall of China but it told me it couldn't do that LOL

I tried asking a A.I. (I don't remember which one it was) about transistor theory when it comes to electron flow between junctions a very complex subject. It came back with a answer that was 100 percent correct and knew what I asking for. The response was correct with the explanation of hole /electron flow theory between a P/N junction. It was pretty impressive that it explains the theory a lot better then any book made on the subject in the last 50 years. Without P/N junctions we wouldn't have microprocessors and then we all wouldn't be here on this forum without them

cacarr · Feb 28, 2025

" ... simulated reasoning models ..."

What exactly is the difference between reasoning and "simulated" reasoning? Or, perhaps you can start by defining "reasoning"?

I suspect that reasoning is as reasoning does.

WXW · Feb 28, 2025

gnesterenko said:
Disagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.

If you don't check, you can't know it actually works as intended, regardless of whether it compiles or passes tests.

My experience has been quite different. From time to time I check out OpenAI models on simple programming tasks I already finished myself. So far, I think they only managed to do one correctly. Even if we removed my aversion to AI in general, I still wouldn't want to use them for programming. The few times I have tested them to check code for potential issues or possible improvements have been more fruitful, though, say, 90% of the output is irrelevant or nonsense, so that still would require some effort from my part.

WXW · Feb 28, 2025

arc-tu-rus said:
The truly valuable and curated stuff is NOT behind corporate walls but behind copyrights - unless you consider the stuff that MBAs and consultants spew out to be more valuable than books and other sources of curated knowledge.

AI companies don't care about copyrights, except when they are caught. If the copyrighted work is not behind walls, it will be in the training data.

WXW · Feb 28, 2025

By contrast, OpenAI's flagship reasoning model, o1 pro, costs $15 per million input tokens and $60 per million output tokens—significantly less than GPT-4.5 despite offering specialized simulated reasoning capabilities. Even more striking, the o3-mini model costs just $1.10 per million input tokens and $4.40 per million output tokens, making it cheaper than even GPT-4o while providing much stronger performance on specific tasks.

Don't reasoning models need more tokens for a given task, though, which would make cost comparisons much less straightforward?

trims2u · Feb 28, 2025

gnesterenko said:
Disagree. I use an LLM bot daily for coding. It usually works as intended, and I move on, with no need to check. Or it doesn't work as intended, in which case I either enhance my prompt, or double check with another source. The latter scenario is maybe 10% of my prompts, if that. In no circumstance would I not be using a LLM in the first place just because 10% of the time I need to do a little extra leg work, because the alternative is to do that legwork 100% of the time.

That seems... like a bad idea if you're doing something more than short scripts.

There's a LOT of code that will pass compiling or even interpret correctly, but has intent errors in it with respect to the rest of your system.

Generative AI is pretty good at acting like a code snippet search engine - the stuff you may see it suggest for scripts and routine coding is stuff easily obtainable via a search engine.

I work on a large C++ project, and we absolutely avoid all the LLMs for it. LOTS of bugs it introduces. For scripting my bash and python and groovy, it seems to work fine because I'm not writing more than a hundred lines of code there and it's pretty easy to tell if there's an intent bug just by looking at it.

“It’s a lemon”—OpenAI’s largest AI model ever arrives to mixed reviews

Account Banned

Seniorius Lurkius

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Scholae Palatinae

Ars Tribunus Militum

Ars Tribunus Militum

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Tribunus Militum

Ars Praetorian

Ars Praetorian

Ars Legatus Legionis

Ars Tribunus Militum

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Praetorian

Seniorius Lurkius

Ars Praetorian

Ars Praetorian

Ars Scholae Palatinae

Ars Centurion

Account Banned

Wise, Aged Ars Veteran

Ars Praetorian

Ars Praetorian

Ars Praetorian

Ars Praetorian

Ars Praetorian

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Ars Scholae Palatinae

Ars Scholae Palatinae

Smack-Fu Master, in training

nproxy.org