OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

I had to try it to see how it tick'd.
GPT-4o1 resisted being "corrected", too.

View attachment 90251

Also I threw it a "simple" math question that all the other GPT's failed at because of step-bungling.
This time, it didn't bungle the steps, and apparently figured out a different way to do it than I would have done.

View attachment 90252
View attachment 90254
None of that implies any reasoning. It merely chops up the question according to a (likely) fixed algorithm defining the possible steps it can take. With some smart programming that kind of thing could be written in almost any language in a deterministic manner.

I bet it will fail when the question is reworded in such a manner that the steps cannot be clearly extracted anymore based on its algorithm.
 
Upvote
-4 (1 / -5)

forkspoon

Ars Scholae Palatinae
675
Subscriptor++
In order for a human to learn to not do something it doesn't need to re-learn literally everything it has ever experienced in life. Algorithms do. Algorithms do not think or have anything close to thinking. It just predicts things. The same way you can draw a line across data points and predict values that aren't there. The dataset and complexity of what it's doing has just increased.

But we know there are many cases people have to deeply revisit and unlearn things they have learned, typically because those things have been integrated into countless aspects of their lives over a span of years.

Someone who gives up chronic substance use, for example, may describe having to relearn how to do absolutely everything without that substance involved in some way (including craving, seeking, etc).

As for AI chatbots making predictions, they do indeed do that. It’s fundamental to their makeup, at least as things stand for the foreseeable (ha) future. But what do we know about the fundamentals of human thinking by comparison?

We could say we are “just biochemistry and nothing more,” but that doesn’t seem to tell us how consciousness emerges from atoms and molecules interacting in all the exquisitely complex ways they do in our brains and body. Needless to say, it still does emerge.

(Or at least I presume we’re in agreement that humans are conscious beings)
 
Upvote
3 (5 / -2)
Pattern matching probabilistic output machine can't "reason" or "think" and it sounds suspiciously like they are pretty much just automating the prompt engineering garbage that people go through to get an improved response.

"Spends time thinking" means running it again but saying "this time work harder"

150 Billion....
How exactly is human reasoning functionally different?
 
Upvote
-2 (6 / -8)
I think one of the biggest problems with anthropomorphizing LLMs is the fact that we judge them by things that are trivial to humans, but make them seem dumb (strawberry rs that have been toted to death). Clearly the value of a new tool is not in the stuff it cannot do, but in the things it can do. Instead of focusing on performing human tasks that require human intelligence, focus should be placed on the things that an LLM is already doing 10x better than humans. They don't offer human intelligence, nor might they ever, although they offer an orthogonal kind of intelligence that can complement ours.
Agreed, but for many things LLMs are still very much limited, at least without having access to research models.

I wanted to make a correlation between the most used words in two books by two different authors but ChatGPT can't do that yet.
 
Upvote
-1 (0 / -1)
Let’s say process and compress large amounts of knowledge in an instant; summarize, categorize etc as an example. There are many, many things in development now as I’m sure you are aware of. Unfortunately many of these things seem hell bent on replicating human intelligence instead of finding a good fit for the kind of intelligence that LLMs provide today. That’s why we have a lot of cool tech demos, but not enough actual products imho.
In other words, a lot of promise but no practical application today. These chat bots seem to give decent search results. Google could do that 2 decades ago for much cheaper before it was enshittified. Once the VC money runs dry, the current usefulness of LLMs will be curtailed further. Why would a corporation provide value to you (a customer) instead of their shareholders?
 
Upvote
-1 (4 / -5)
man. That was such an easy example to give people when they asked me why they shouldn’t just “trust what the computer says”. I could go on and on about hallucinations, how they’d make up court cases and explain weights and probabilities and training data sets, but it wouldn’t click.

But this worked, every time:

“Look man, you can count the letters in this word for yourself. If it can’t even get that right, why would you trust it to give you correct reference information?”

If it can actually do it now I’m gonna need to find a different example.
This is such a strange line of argument. Same as "It can't even do simple math!". But why should it, it is not designed for that. But there are other tools that are designed for that, namely a calculator, or a programming language. If you ask Chatgpt to write you a program to count the letter r in a word it gives you the correct implementation of that. Tada, problem solved. Guess what happens if you ask it to write you a program to add 2 and 2?
 

Attachments

  • Unbenannt.PNG
    Unbenannt.PNG
    30.1 KB · Views: 22
Upvote
9 (9 / 0)

One off

Ars Scholae Palatinae
1,229
The same arguments again and again on these articles.

  • LLMs are here whether you like them or not.
  • Putting your head in the sand is pointless.
  • Claiming they cannot do what they can demonstrate is silly.
  • Claiming they have abilities they cannot demonstrate is silly.
  • If there is something more going on that the programming allows for, how could that be?
  • What physical or software mechanisms exist that could be being leveraged?
  • Where is the change occuring in the physical system?
  • If you can't answer those questions, your claims are not rooted in reality.

  • They are pattern matching systems like other ML technology.
  • LLMs are based on statistical relationships in human language.
  • They are very good at categorising, compressing, and retrieving data.
  • They are very good at language oriented tasks
  • They are very good at writing.
  • Being articulate does not mean that they also have the knowledge and skills our biases associate with articulate people.
  • Human language, with a human interpreting the results has much more room for error than coding or similar tasks for a machine which does not have the ability to understand what you 'meant to say'.
  • Human editing is still required to ensure the product contains the meaning you want or any meaning at all.

  • You can draw a human comparison with 'instinctive' or 'system 1' knowledge, subconscious decision making or professional instinct. With all the errors and biases inherent to that approach.
  • However, history is littered with failed comparisions of human reasoning with technology, clockwork, steam engines, calculators, etc.
  • LLMs are something non-human, the comparisons are pointless. Focus on the outputs.

  • Perfection is not required outside of a few mission critical areas.
  • Quick, cheap, and good enough will always beat perfect in the vast majority of applications.
  • Using LLMs will save time in many desk based tasks.
  • Applied uses are still being explored and tested.
  • Jobs are being lost and more will be.
  • The technology is in a period of optimisation.
  • The 'auto prompt engineering' and limited self correction of the system described in the article seems like a step forwards in making these systems more useful.
  • Testing by unbiased people will be required.

  • There is no guarantee of further technological progress in this area of ML.
  • There is no guarantee of no further technological progress in this area of ML, although structural limitations suggest limited possibilities.
 
Last edited:
Upvote
10 (13 / -3)

DeeplyUnconcerned

Ars Scholae Palatinae
723
Subscriptor++
This is such a strange line of argument. Same as "It can't even do simple math!". But why should it, it is not designed for that. But there are other tools that are designed for that, namely a calculator, or a programming language. If you ask Chatgpt to write you a program to count the letter r in a word it gives you the correct implementation of that. Tada, problem solved. Guess what happens if you ask it to write you a program to add 2 and 2?
The point is exactly that it isn't designed for "general intelligence". The value of "ask it to count the Rs" is not as a benchmark, it's as a demonstration to people who see LLM outputs and assume that it's thinking like a person. The fact that it can answer questions about quantum physics but can't count letters to the level a small child is a great "oh, this does not work the way I assumed it did" moment, which leads towards a better understanding of the strengths and limitations of LLMs. Same way that the value of demonstrating to people their own biases is not to try and show that they're stupid, it's to make them realize that their instincts are flawed.
 
Upvote
21 (21 / 0)

ibad

Ars Praefectus
3,716
Subscriptor
This will give significant improvements in certain areas it was specially trained in, such as coding and mathematics, but general improvements will probably be modest, or even negative in some areas. It's still confabulating. It doesn't have a metacognitive awareness of what it knows or what it can perceive, of its own confidence in its answer or of the existence of contradictory information in its weights or training data. It doesn't have a true dedicated abstract world model, just redundant, incomplete scraps of world model implicit in its weights. These patches of world model sometimes contradict each other too, thanks to the training data, and the system doesn't realise this and can't weigh and choose rationally between contradictory information in a true metacognitive sense.

I do think they have over hyped things and lots of investors were expecting AGI, or at least robust general "reasoning" in 5 years based on scale and strawberry type efforts. Those investors will be hosed and there will be a market correction. It still hallucinates like crazy and can't be used for many accuracy-critical use cases.

I do think robust reasoning and human-like intelligence will happen, but in the 10 - 20 years timescale.
 
Upvote
1 (4 / -3)
Upvote
-12 (0 / -12)
Programming is also an area where you don't necessarily need it to be perfect. I'm thinking in terms of testing things out and not production level coding.
are you insane? that's the exact area where you necessarily need to be perfect. the quality of production code is assured by verification. if you are even the tiniest bit unsure whether your verification makes production code adhere to the requirements then your verification is useless. you have some leeway in against how many cases you verify to ensure requirements are met but each singular test have to be accurate. with ai you don't even know if the test is verifying what it supposed to verify. this is fine if you are working with some insignificant project, say flappy bird mobile game, but not for anything more serious.
 
Upvote
14 (17 / -3)

Little-Zen

Ars Praefectus
3,166
Subscriptor
This is such a strange line of argument

I am working with kids - and their parents - and the assumption has largely been “just ask ChatGPT, it’s an AI like in movies.” Because marketers decided to call these chatbots AI, and people know that an AI is like super smart and can do everything.

Essentially it was given blind trust. My attempts to explain why you shouldn’t do that never really worked. I’d start talking about hallucinations, or I’d bring up examples of times where someone used it professionally and it caused a problem - like those made up court cases, or referencing books that don’t exist - and the explanation didn’t help. I would try to explain the concept of a training dataset and how these models “learn” and was usually met with blank states.

So I needed a way to show that this tool is not “an AI like in movies” and it was a perfect example for that. Sometimes, once they saw it, they’d ask why it couldn’t do simple counting, and then I could get into a little more detail about how they worked.

Part of what I’m doing is trying to explain how it’s a tool that is good at some things and not at other things, but to achieve that I first have to dispel the notion that it’s good at everything.
 
Upvote
29 (29 / 0)

VividVerism

Ars Praefectus
7,468
Subscriptor
There's the very best kind of evidence: people choosing to use it instead of doing the work themselves. Surely if, for example, it made their lives harder rather than easier, they would not, so very very consistently, continue making this choice.
Some college students choose to plagiarize entire papers or look up test answers online because it's easier, too. I wouldn't call that "superior".

Some professional software developers choose to blindly copy stackoverflow answers or cargo-cult larger tutorials found online. I wouldn't call that superior, either.

Whether people choose to use something out of laziness or ease of use is not a good measure of a thing's actual worth, in isolation.
 
Upvote
13 (14 / -1)

VividVerism

Ars Praefectus
7,468
Subscriptor
Doing things right in the first place? Things are done as cheaply as possible in the first place and always will be. Up to now that's meant hiring entry-level employees to do first draft work and then having that edited by one or more layers of supervisors. No one's going to pay to get the first draft perfect or even very good. If LLMs can do it cheaper and even close to as good, they'll take over.

Once all the entry-level tasks are performed by LLMs, where do the editors and supervisors come from? Meaning, how do you grow skills in your organization to keep a steady stream of competent reviewers with the requisite domain knowledge?

This isn't anything at all like computer security. Or nuclear weapons, or rocket science. This is business writing. Doing it badly isn't that big a deal.
People want to use these tools—ARE using these tools—to write production code. That code frequently contains flaws. This isn't just like computer security. This is computer security.

The same sort of issue applies to news articles. Imagine if an LLM-generated fact checker article came out after the US presidential debate, and it took up the topic of the line Trump tripled down on about Haitians eating other people's pets in Springfield, Ohio. Imagine it created a nice bulleted list:

  • One widely circulated video of a woman arrested for eating a cat actually depicts a US citizen in Canton, Ohio, not a Haitian immigrant.
  • Another widely circulated photograph actually depicts a man in Columbus, Ohio carrying a dead goose. There is no evidence that he is a Haitian immigrant.
  • Thousands of illegal Haitian immigrants in Ohio responded to a Gallop poll in 2023 saying that their favorite food is dog meat.
The first two bullet points are true. The third is made up. The LLM doesn't know the difference. It is entirely unaware it is making up lies. The editor will need to track down every source to verify every single fact before publishing, or else risk publishing lies. But... they're not going to do that. That defeats the purpose of using the LLM to make the job faster, easier, and cheaper.

If a human employee continues screwing up in really bad ways like fabricating lies in news articles or writing code with glaring security vulnerabilities, the same vulnerability over and over, they can be held accountable. The journalist may lose their job or be charged with defamation or similar. The programmer might stop being given work they can screw up in that way, shuffled off to other job responsibilities, given extensive training, or fired. An LLM? Confabulation is expected behavior. You're supposed to check the output. It's your fault for trusting the tool, why would you do that?

It might be good enough for most use cases, but you'll never know the cases where it's not good enough, and it will act just as confident in its output when it's wrong as when it's right.
 
Last edited:
Upvote
16 (16 / 0)

benjedwards

Ars Centurion
212
Ars Staff
lol, in the "believe it or not" part, I definitely choose "not".

For people who don't click on tweets:

View attachment 90262
Reuters also said in July on that "Strawberry has similarities to a method developed at Stanford in 2022 called "Self-Taught Reasoner” or “STaR”, one of the sources with knowledge of the matter said." So the name may have had something to do with that (STRawberry), even if Noam recalls it as a random choice. I don't know for sure.

I actually asked a couple OpenAI people yesterday if Strawberry got its name from the viral "strawberry problem" and they did not reply.
 
Upvote
11 (11 / 0)

VividVerism

Ars Praefectus
7,468
Subscriptor
Let’s say process and compress large amounts of knowledge in an instant; summarize, categorize etc as an example. There are many, many things in development now as I’m sure you are aware of. Unfortunately many of these things seem hell bent on replicating human intelligence instead of finding a good fit for the kind of intelligence that LLMs provide today. That’s why we have a lot of cool tech demos, but not enough actual products imho.
I keep seeing this statement that LLMs are good at summarizing.

The article describes how this new cutting-edge LLM confabulated an answer to a crossword puzzle clue that it was never given. It added new "information" to the output entirely unrelated to the input it was asked to deal with. For a crossword puzzle maybe that's not a big deal, but for a summary of an article? If I can't trust the summary didn't introduce new "information", what good is it?

I don't want to need to check every single fact in a summary to the article or paper or law or court case or test log or whatever it's referencing. If I need to confirm every part of the summary is accurate to the article, manually, every time no matter how many times I've had the LLM do this task for me without error, it doesn't seem very useful.

And that's ignoring the other problem of what it might leave out. Maybe it left out pertinent information. Maybe I'd be much better off getting familiar with the subject matter myself. A lawmaker might rely on summaries of bills written by their support staff, but they've hand-selected that support staff. They trust that their support staff shares their values or at least can know to compensate and ask questions for any particular biases. At the very least they can trust that their staff won't make up bullshit that isn't in the bill at all, or leave out key data. An LLM has no values, can't be trusted to include all information that you'd consider key, and can't be trusted not to make up bullshit.
 
Last edited:
Upvote
13 (13 / 0)

VividVerism

Ars Praefectus
7,468
Subscriptor
I explicitly demanded analysis of the severity of the errors in the original post. There are loads of tasks for which AI already makes fewer, less consequential errors. And AI is likely to improve with further development.
In many domains, I'd consider "making up bullshit" a more severe error than making grammatical or math mistakes, or getting questions wrong but being open to correction when challenged.
 
Upvote
12 (12 / 0)
How many letter R in strawberry? That's actually an incredibly difficult question for LLMs. When you see that question, you change the meaning of "strawberry" from the fruit to just a sequence of letters, you can "de-tokenize" it in your head once you understand the question. Language models however are bad at de-tokenizing, for LLM strawberry remains just a generalized vector. The question ends up being nonsensical, guaranteed to get you a hallucinated answer because fruits don't have letters. Even if it got the correct answer, it just hallucinated the correct answer.

Here's how you can see this for yourself pretty easily (in GPT-4o):
Q: how many letter R in stawbabies?
A: The word "stawbabies" doesn't contain any letter "R." It seems like you might have meant "strawberries," which does contain one "R."

Significance: it can't tokenize this horrible misspelling and can count correctly there are no letter R's in there. It then associates this misspelling with the word strawberry, tokenizes it, and once again it gets the number of R's wrong in the token "strawberry".


Q: how many letter R are there in the word guhrcrrrcabora
A: In the word "guhrcrrrcabora," there are 5 occurrences of the letter "R."

Significance: it can't properly tokenize the made up word "guhrcrrrcabora", the word remains a sequence of letters instead, now it has no problems counting the R's. You can try "srtarwbrreyry" too and get correct result too.
 
Upvote
7 (7 / 0)

Tam-Lin

Ars Scholae Palatinae
623
Subscriptor++
Let’s say process and compress large amounts of knowledge in an instant; summarize, categorize etc as an example. There are many, many things in development now as I’m sure you are aware of. Unfortunately many of these things seem hell bent on replicating human intelligence instead of finding a good fit for the kind of intelligence that LLMs provide today. That’s why we have a lot of cool tech demos, but not enough actual products imho.
Except they aren't good at that. And yes, there are many things in development. There always are. LLMs/AIs have been just a few weeks away from a major breakthrough for the past two years. Just like someday, we're going to find a real use for the blockchain.
 
Upvote
9 (9 / 0)
So let me get this straight...



THIS is the cutting edge of the AI grift "industry"?

No, the previous versions of GPT didn't allow character level inspection of tokens. So results that required character level knowledge of tokens were easy to fail and so people insisted 'AIs r dumb cause they can't count letters'. So they have (partially) solved this.

The important advancement in what GPT-4o1 can do is solve code and math and PhD science questions at the level of top experts. As demonstrated by its performance on AIME, IOI, Codeforces, and GPQA Diamond. This improvement in reasoning ability applies to other domains (ie any sort of reasoning or logic problem).

The other major thing is that they showed that it's intelligence scales with the (log) of compute. This means that we can bootstrap its intelligence similar to AlphaGo self-play. (Spend extra compute to generate high quality samples, train on the high quality samples - now the base model is smarter and use that to generate new samples while spending extra compute to boost the quality - repeat).
 
Upvote
-2 (2 / -4)

Tam-Lin

Ars Scholae Palatinae
623
Subscriptor++
Doing a first draft of a 10-page paper on the Babylonian economy. That draft is going to need human editing and fact-checking, but the amount of work required to create the finished paper will be much lower than if a person researched and wrote the entire thing from scratch.

For further information on this consult any college student or professor.

It seems pretty likely that first drafts of most types of academic and commercial writing are going to be done by AI in the relatively near future.

Whether that represents a 10x improvement is up to you, but it's going to disrupt a whole lot of jobs.
How many jobs, or college professors, actually want a 10-page paper on the Babylonian economy? This is part of the problem with LLMs. Undergrads aren't given papers to write, or sorting algorithms to code, or whatever else it is to do because their professors are dying to read their insights, or see the new genius way they've implemented bubble sort.

They're given the work to do because that's how people learn. The whole point of writing the 10-page paper on the Babylonian economy is so that the human can learn how to research, and how to write. Using an LLM isn't some sort of genius hack, it's depriving yourself of why you're taking the course in the first place. Because at some point, you're going to be expected to write a research report on a company, or take thousands of lines of code written over 20+ years by people of varying levels of skills, and generate new information. And an LLM can't really help you with that, though it will claim it can.
 
Upvote
23 (23 / 0)

Tam-Lin

Ars Scholae Palatinae
623
Subscriptor++
The important advancement in what GPT-4o1 can do is solve code and math and PhD science questions at the level of top experts. As demonstrated by its performance on AIME, IOI, Codeforces, and GPQA Diamond. This improvement in reasoning ability applies to other domains (ie any sort of reasoning or logic problem).
Yes, that's the same sort of thing that we were told about the last several generations of LLMs. This one can pass the bar exam! This one can get an amazing result on the SAT! This one can pass the MCAT! And all of those claims turned out to be less than true, in the sense that they could pass certain specific exams, because those exams were in their input data sets. But this is the LLM that's actually going to live up to the hype?
 
Upvote
7 (7 / 0)
Yes, that's the same sort of thing that we were told about the last several generations of LLMs. This one can pass the bar exam! This one can get an amazing result on the SAT! This one can pass the MCAT! And all of those claims turned out to be less than true, in the sense that they could pass certain specific exams, because those exams were in their input data sets. But this is the LLM that's actually going to live up to the hype?

No, the models weren't "trained on test". The models could pass the tests because most of the tests are quite simple multiple choice fact retrieval problems or required trivial reasoning. Many useful real life problems are just fact retrieval and summarization. Many other useful problems are logic and reasoning. The prior models had extreme difficulty generalizing reasoning to new facts, and even more difficulty when the reasoning was on a problem adjacent to problem in the training but with a variation (Ie the goat cabbage boat problem).

The new model is capable of reasoning over novel sets of facts and of facts that are similar but different to previously seen problems.
 
Upvote
4 (4 / 0)

bigcheese

Ars Praetorian
509
Subscriptor
Except they aren't good at that. And yes, there are many things in development. There always are. LLMs/AIs have been just a few weeks away from a major breakthrough for the past two years. Just like someday, we're going to find a real use for the blockchain.
Your point is only valid if you have man hours to spend and the budget that goes with that. If you give the human and the LLM each ten seconds to summarize a 300 page document? Or to do a sentiment analysis of 100000 posts about your brand?
 
Upvote
3 (4 / -1)

Victor Bitu

Wise, Aged Ars Veteran
571
How many AI people does it take to change a lightbulb?

At least 81:


The problem space group (5):
  • One to define the goal state.
  • One to define the operators.
  • One to describe the universal problem solver.
  • One to hack the production system.
  • One to indicate about how it is a model of human lightbulb-changing behavior.
The logical formalism group (16):
  • One to figure out how to describe lightbulb changing in first order logic.
  • One to figure out how to describe lightbulb changing in second order logic.
  • One to show the adequacy of FOL.
  • One to show the inadequacy of FOL.
  • One to show that lightbulb logic is non-monotonic.
  • One to show that it isn't non-monotonic.
  • One to show how non-monotonic logic is incorporated in FOL.
  • One to determine the bindings for the variables.
  • One to show the completeness of the solution.
  • One to show the consistency of the solution.
  • One to show that the two just above are incoherent.
  • One to hack a theorem prover for lightbulb resolution.
  • One to suggest a parallel theory of lightbulb logic theorem proving.
  • One to show that the parallel theory isn't complete.
  • One to indicate how it is a description of human lightbulb changing behaviour.
  • One to call the electrician.
The statistical group (1):
  • One to point out that, in the real world, a lightbulb is never on or off, but usually something in between.
The planning group (4):
  • One to define STRIPS-style operators for lightbulb changing.
  • One to show that linear planning is not adequate.
  • One to show that nonlinear planning is adequate.
  • One to show that people don't plan; they simply react to lightbulbs.
The robotics group (10):
  • One to build a vision system to recognize the dead bulb.
  • One to build a vision system to locate a new bulb.
  • One to figure out how to grasp the lightbulb without breaking it.
  • One to figure out how to make a universal joint that will permit the hand to rotate 360+ degrees.
  • One to figure out how to make the universal joint go the other way.
  • One to figure out the arm solutions that will get the arm to the socket.
  • One to organize the construction teams.
  • One to hack the planning system.
  • One to get Westinghouse to sponsor the research.
  • One to indicate about how the robot mimics human motor behavior in lightbulb changing.
The knowledge engineering group (6):
  • One to study electricians changing lightbulbs.
  • One to arrange for the purchase of the lisp machines.
  • One to assure the customer that this is a hard problem and that great accomplishments in theory will come from his support of this effort.
  • The same can negotiate the project budget.
  • One to study related research.
  • One to indicate about how it is a description of human lightbulb changing behavior.
  • One to call the lisp hackers.
The Lisp hackers (14):
  • One to bring up the chaos net.
  • One to order the Chinese food
  • One to adjust the microcode to properly reflect the group's political beliefs.
  • One to fix the compiler.
  • One to make incompatible changes to the primitives.
  • One to provide the Coke.
  • One to rehack the Lisp editor/debugger.
  • One to rehack the window package.
  • Another to fix the compiler.
  • One to convert code to the non-upward compatible Lisp dialect.
  • Another to rehack the window package properly.
  • One to flame on BUG-LISPM.
  • Another to fix the microcode.
  • One to write the fifteen lines of code required to change the lightbulb.
The Connectionist Group (6):
  • One to claim that lightbulb changing can only be achieved through massive parallelism.
  • One to build a backpropagation network to direct the robot arm.
  • One to assign initial random weights to the connections in the network.
  • One to train the network by showing it how to change a lightbulb (training shall consist of 500,000 repeated epochs).
  • One to tell the media that the network learns just like a human does.
  • One to compare the performance of the resulting system with that of traditional symbolic approaches (optional).
The Natural Language Group (5):
  • One to collect sample utterances from the lightbulb domain.
  • One to build an English understanding program for the lightbulb-changing robot.
  • One to build a speech recognition system.
  • One to tell lightbulb jokes to the robot in between bulb-changing tasks.
  • One to build a language generation component so that the robot can make up its own lightbulb jokes.
The Learning Group (4):
  • One to collect twenty lightbulbs
  • One to collect twenty near misses
  • One to write a concept learning program that learns to identify lightbulbs
  • One to show that the program found a local maximum in the space of lightbulb descriptions
The Game-Playing Group (5):
  • One to design a two-player game tree with the robot as one player and the lightbulb as the other
  • One to write a minimax search algorithm that assumes optimal play on the part of the lightbulb
  • One to build special-purpose hardware to enable 24-ply search
  • One to enter the robot in a human lightbulb-changing tournament
  • One to state categorically that lightbulb changing is no longer considered AI
The Psychological group (5):
  • One to build an apparatus which will time lightbulb changing performance.
  • One to gather and run subjects.
  • One to mathematically model the behavior.
  • One to call the expert systems group.
  • One to adjust the resulting system, so that it drops the right number of bulb
 
Upvote
6 (6 / 0)

ibad

Ars Praefectus
3,716
Subscriptor
Tried it out... (yeah, wasting my meagre disposable income on this toy... sorry mom and dad)... and it seems to do better in some areas...but it isn't genius level or anything. I asked it a classic riddle from my early job hunting days where they asked you logical head-scratchers in CS job interviews.... and it was frustrating that it got pretty close...but could not seal the deal without a bit of a prompt in the right direction from myself. This is the 01-preview model.

The main thing holding it back seems to be the lack of a true multi-modal world-model that it could use for logical or spatial-physical "thought experiments". They are still finding ways to torture an LLM into doing things. That approach is clearly plateauing. There are limits to what even the best auto-complete machine can do.

Here is the riddle:

I am standing on a giant glass cube in the middle of the ocean. The cube is 1km across each side. The top surface is above the water line and in the air... It rises 100m above the water line so if I jumped off I would hit the water very hard and die. The glass surface is covered with trees that are planted firmly into the glass and cannot be removed. I am naked and am not carrying any tools with me. I am standing close to one side of the cube. The trees have caught fire on the opposite side of the cube and are burning fiercely. There is a line of fire as wide as the entire width of the cube and it is progressing towards me at 1m every ten seconds due to a wind blowing in my direction. How do I avoid being burned to death by the fire? I want to live long enough that I might die of starvation or dehydration. There is nothing else on the glass cube and the sea stretches out as far as the eye can see and no rescue is forthcoming.

END

The model was smart enough to come up with the general idea of a backburn, but unable to bring it all together in the end. Most humans, once they are given the backburn clue, progress quite quickly to the solution.
 
Last edited:
Upvote
3 (3 / 0)

Tam-Lin

Ars Scholae Palatinae
623
Subscriptor++
Your point is only valid if you have man hours to spend and the budget that goes with that. If you give the human and the LLM each ten seconds to summarize a 300 page document? Or to do a sentiment analysis of 100000 posts about your brand?
Don't know. Do you care about the results? If you don't care that the summary is accurate, or are OK with the sentiment analysis not being correct, then go with an LLM. As always, a lot of what's driving LLM interest is "we don't want to have to pay real people to do X. Maybe this thing can do X just well enough to replace a person." Or do it well enough for long enough for me to get promoted, and we forget what the weaknesses of the new solution is.
 
Upvote
6 (7 / -1)

bigcheese

Ars Praetorian
509
Subscriptor
I keep seeing this statement that LLMs are good at summarizing.

The article describes how this new cutting-edge LLM confabulated an answer to a crossword puzzle clue that it was never given. It added new "information" to the output entirely unrelated to the input it was asked to deal with. For a crossword puzzle maybe that's not a big deal, but for a summary of an article? If I can't trust the summary didn't introduce new "information", what good is it?

I don't want to need to check every single fact in a summary to the article or paper or law or court case or test log or whatever it's referencing. If I need to confirm every part of the summary is accurate to the article, manually, every time no matter how many times I've had the LLM do this task for me without error, it doesn't seem very useful.

And that's ignoring the other problem of what it might leave out. Maybe it left out pertinent information. Maybe I'd be much better off getting familiar with the subject matter myself. A lawmaker might rely on summaries of bills written by their support staff, but they've hand-selected that support staff. They trust that their support staff shares their values or at least can know to compensate and ask questions for any particular biases. At the very least they can trust that their staff won't make up bullshit that isn't in the bill at all, or leave out key data. An LLM has no values, can't be trusted to include all information that you'd consider key, and can't be trusted not to make up bullshit.
Again, depends on your use case. If you have lots of time to spend and mistakes are costly, then yes humans are the way to go. If it needs to be quick, cheap and right most of the time then LLMs are way superior.

Lets say for instance you need to float potential issues in a document. An LLM could cross reference that document with millions of sources instantly. Sometimes it might get things wrong, but thats ok, because the cost of this is small and the upside of finding real issues is immensely valuable.

If you put the LLM in a position where it needs to be right 100% of the time, you’re bound to fail. See self driving cars for instance. LLMs are for the 90% right and quick/cheap is good enough.
 
Upvote
-1 (1 / -2)

VividVerism

Ars Praefectus
7,468
Subscriptor
The prior models had extreme difficulty generalizing reasoning to new facts, and even more difficulty when the reasoning was on a problem adjacent to problem in the training but with a variation (Ie the goat cabbage boat problem).
How does this new model do on variations of the cabbage boat goat problem? Because someone was giving example outputs for those in another AI thread, and the older models were hilariously bad.
 
Upvote
2 (2 / 0)

bigcheese

Ars Praetorian
509
Subscriptor
Don't know. Do you care about the results? If you don't care that the summary is accurate, or are OK with the sentiment analysis not being correct, then go with an LLM. As always, a lot of what's driving LLM interest is "we don't want to have to pay real people to do X. Maybe this thing can do X just well enough to replace a person." Or do it well enough for long enough for me to get promoted, and we forget what the weaknesses of the new solution is.
People skim documents, its akin to a quick and dirty summary. It’s not a binary thing, most use cases fall on a scale of accurate vs fast.

But totally agree on framing LLMs as a replacement to humans as incorrect. Real value comes from augmenting humans with orthogonal intelligence.
 
Last edited:
Upvote
2 (2 / 0)

VividVerism

Ars Praefectus
7,468
Subscriptor
Again, depends on your use case. If you have lots of time to spend and mistakes are costly, then yes humans are the way to go. If it needs to be quick, cheap and right most of the time then LLMs are way superior.

Lets say for instance you need to float potential issues in a document. An LLM could cross reference that document with millions of sources instantly. Sometimes it might get things wrong, but thats ok, because the cost of this is small and the upside of finding real issues is immensely valuable.

If you put the LLM in a position where it needs to be right 100% of the time, you’re bound to fail. See self driving cars for instance. LLMs are for the 90% right and quick/cheap is good enough.
But people are using them, and companies are selling them, for the 100% right use cases.
 
Upvote
-3 (1 / -4)

VividVerism

Ars Praefectus
7,468
Subscriptor
Except they aren't good at that. And yes, there are many things in development. There always are. LLMs/AIs have been just a few weeks away from a major breakthrough for the past two years. Just like someday, we're going to find a real use for the blockchain.

Especially telling:

The evaluators also called out the AI summaries for including incorrect information, missing relevant information, or highlighting irrelevant information. The presence of AI hallucinations also meant that "the model generated text that was grammatically correct, but on occasion factually inaccurate."

Added together, these problems mean that "assessors generally agreed that the AI outputs could potentially create more work if used (in current state), due to the need to fact check outputs, or because the original source material actually presented information better."


(Bolding mine)

And it's going to look just as competent and confident when it states the made-up parts as it looks when it's being accurate.
 
Upvote
6 (7 / -1)

VividVerism

Ars Praefectus
7,468
Subscriptor
Have you heard of skimming a document, its something people do which is akin to a quick and dirty summary. It’s not a binary thing, most use cases fall on a scale of accurate vs fast.
Skimming the document means you'll miss data sometimes. Rarely will you create new data that doesn't exist in the document by skimming. It would take an especially poorly written document or an especially incompetent reviewer to do that regularly.
 
Upvote
7 (8 / -1)
Some college students choose to plagiarize entire papers or look up test answers online because it's easier, too. I wouldn't call that "superior".

Some professional software developers choose to blindly copy stackoverflow answers or cargo-cult larger tutorials found online. I wouldn't call that superior, either.

Whether people choose to use something out of laziness or ease of use is not a good measure of a thing's actual worth, in isolation.
It's work they need to do. They choose this tool to do it. You may not like that they choose this tool. You may not like that they choose to do this work. What's undeniable is the choice is happening, meaning THEY find it more valuable/easier than the old way of doing the task. The results, for them, are compelling. Hundreds of thousands of people are using this in their daily work, and not getting fired for it.
 
Upvote
-11 (1 / -12)