OpenAI’s new “reasoning” AI models are here: o1-preview and o1-mini

DK2

Ars Centurion
314
Subscriptor++
Yeah if they just said "hey this is good for summarization and brainstorming, and certain professions like coders and lawyers/paralegals will see efficiency gains" I'd be far less of an jerk about their janky tech.

WE ARE SOVING GENERAL INTELLIGENCE is total bullshit, but thats what gets you the 150 Billion valuation. Being honest gets you thrown out of the VC office.
The VCs don't believe the hype. They're looking at revenues and exit prices and the AGI stuff is irrelevant to that. If you can automate even 10% of commercial writing you're basically going to mint money, which means a hockey-stick revenue line that gets you the IPO in 2 years.

Altman talks AGI because that makes him more important, but I don't think the money people give a shit.
 
Upvote
15 (17 / -2)

xoe

Ars Scholae Palatinae
7,496
Subscriptor
Doing a first draft of a 10-page paper on the Babylonian economy. That draft is going to need human editing and fact-checking, but the amount of work required to create the finished paper will be much lower than if a person researched and wrote the entire thing from scratch.

For further information on this consult any college student or professor.

It seems pretty likely that first drafts of most types of academic and commercial writing are going to be done by AI in the relatively near future.

Whether that represents a 10x improvement is up to you, but it's going to disrupt a whole lot of jobs.
Is there any evidence that such a workflow is superior to a human only workflow in any way by any amount?
 
Upvote
16 (17 / -1)

DK2

Ars Centurion
314
Subscriptor++
Is there any evidence that such a workflow is superior to a human only workflow in any way by any amount?
Superior by what measure? If LLMs can produce a first draft that's 80% as good as a human at 10% the price, the LLMs are going to take over that market.

LLMs are tools. Thinking about them as intelligences is a category error.
 
Upvote
-5 (5 / -10)

WXW

Ars Scholae Palatinae
1,081
I feel like they'd be better off optimizing it for programming.

That's the one task that I routinely have success with and I can absolutely understand the monetary value of it.

Any other tasks I've tried with Generative AI tools tends to result in frustration. Times where it almost gets it right, but not quite. Or times where it worked yesterday, but not today.

Programming is also an area where you don't necessarily need it to be perfect. I'm thinking in terms of testing things out and not production level coding.
Programming is the thing I've tried again and again with ChatGPT, and I don't think I've ever got any correct piece of code... I guess it depends on the type of code.
 
Upvote
1 (2 / -1)
I'm an 'AI' optimist. These techniques can do valuable things. You just have to make sure that you are measuring reasonably. If I set the AI a task, what percentage of the time will it be correct. How serious will the errors be? And then compare to your average human doing the same thing. People make errors too, all the AI has to be is better at these tasks than the average human for it to very quickly save companies money. The cost to run these models (vs train them) is nothing compared to the cost of paying for a human to sit in a chair and type. There are a very large number of currently human driven tasks that can be offloaded to this kind of AI.
I’m pretty confident that if OpenAI just paid some guy five bucks to answer each time a “how many letters” query came up, they’d have saved a significant amount of money by now.
 
Upvote
9 (10 / -1)

xoe

Ars Scholae Palatinae
7,496
Subscriptor
Superior by what measure? If LLMs can produce a first draft that's 80% as good as a human at 10% the price, the LLMs are going to take over that market.

LLMs are tools. Thinking about them as intelligences is a category error.
By any measure. Total time to complete task, quality of end result, total cost of completing task.
 
Upvote
-1 (3 / -4)

mdrejhon

Ars Tribunus Militum
2,927
Subscriptor
I had to try it to see how it tick'd.
GPT-4o1 resisted being "corrected", too.

Screenshot 2024-09-12 at 6.10.52 PM.png

Also I threw it a "simple" math question that all the other GPT's failed at because of step-bungling.
This time, it didn't bungle the steps, and apparently figured out a different way to do it than I would have done.

Screenshot 2024-09-12 at 6.12.26 PM.png
Screenshot 2024-09-12 at 6.14.20 PM.png
 
Last edited:
Upvote
15 (15 / 0)
So, I just took this strawberry spelling anomaly on spec, but trying it out on just the boring regular gpt-4o-2024-08-06 (via the API) gives me the following:

How many "r"s are there in the word "strawberry"? Step through your reasoning.

To determine how many "r"s are in the word "strawberry," let's go through the word letter by letter:
  1. s - no "r"
  2. t - no "r"
  3. r - this is the first "r"
  4. a - no "r"
  5. w - no "r"
  6. b - no "r"
  7. e - no "r"
  8. r - this is the second "r"
  9. r - this is the third "r"
  10. y - no "r"
Counting the "r"s we identified, we have:
  • The first "r" at position 3
  • The second "r" at position 8
  • The third "r" at position 9
Thus, there are a total of 3 "r"s in the word "strawberry."

I then pushed it further:

Great! Using the same strategy, can you tell me how many "s"es there are in the sentence "Ars technica's posters who comment on LLM articles aren't always the most informed, and the discussion sometimes degrades into polemics"?

Certainly! Let's go through the sentence letter by letter to count the number of "s"es.
The sentence is:
"Ars technica's posters who comment on LLM articles aren't always the most informed, and the discussion sometimes degrades into polemics"
  1. A - no "s"
  2. r - no "s"
  3. s - this is the first "s"
<snip> I won't bore you with the iteration through every single letter of that sentence; suffice to say that it lists every letter in turn and counts them up correctly: </snip>
  1. p - no "s"
  2. o - no "s"
  3. l - no "s"
  4. e - no "s"
  5. m - no "s"
  6. i - no "s"
  7. c - no "s"
  8. s - this is the fourteenth "s"
Counting all the "s"es, we find there are a total of 14 "s"es in the sentence.

(You can check via: perl -e '$_ = "Ars technica\'s posters who comment on LLM articles aren\'t always the most informed, and the discussion sometimes degrades into polemics"; print tr/s/s/,"\n"'. 14 is the correct answer.)

As always, asking an LLM to step through it's reasoning massively increases the accuracy and processing ability of the LLM. I have no idea how it actually did this, btw -- I was fully expecting it to fail, or at least start to lose count and get confused once it got beyond a few repetitions.

Incidentally, for anyone still thinking that an LLM is just a random word generator with no additional processing ability -- please explain to me how it managed to do this? Because I'm pretty sure that sentence doesn't appear in any training data, let alone someone responding by counting up the "s"es one by one.
 
Upvote
-5 (7 / -12)
I guess they get some points for being self aware with the naming at least. If anyone doesn't know the reference it's a pretty famous example of the problem with how token work:

View attachment 90214

The strawberry code name was months before anyone ran that problem that became well known. It is mere coincidence.

Tokenization not being character aware has long been known. Tokenization means the model doesn't see characters, it sees vectors representing meanings associated with the token. Tokenization is essentially translation to a different language that contains only semantic meaning and completely lacks knowledge of typographic form. Semantic meanings sometimes encode relationships between character tokenizations but it is rare.

The reason for tokenization is that there are relationship graphs ('attention') between tokens so the fewer tokens, the fewer relationships have to be tracked. The most naive graph is one to one, so memory grows with the square of the number of tokens. Also for common enough words, the vector representing meaning is usually useful to aggregate at the word level except for rare words.
 
Upvote
4 (6 / -2)
I don't actually believe that. Do you have some documentation?

According to Noam Brown (the researcher responsible for Claude the No-Limit Hold'em bot, he is one of the main researchers on this version of GPT)

Believe it or not, the name Strawberry does not come from the “How many r’s are in strawberry” meme. We just chose a random word. As far as we know it was a complete coincidence.

https://x.com/polynoamial/status/1834312400419652079
Altman was doing Strawberry hints in early august. (August 7)

https://x.com/sama/status/1821207141635780938?lang=en
I can find a mention of the issue June 19th, so conceivably the idea is from the bug/meme,

https://community.openai.com/t/incorrect-count-of-r-characters-in-the-word-strawberry/829618
 
Last edited:
Upvote
4 (5 / -1)
I think one of the biggest problems with anthropomorphizing LLMs is the fact that we judge them by things that are trivial to humans, but make them seem dumb (strawberry rs that have been toted to death). Clearly the value of a new tool is not in the stuff it cannot do, but in the things it can do. Instead of focusing on performing human tasks that require human intelligence, focus should be placed on the things that an LLM is already doing 10x better than humans. They don't offer human intelligence, nor might they ever, although they offer an orthogonal kind of intelligence that can complement ours.
The issue with counting r’s is not that people are using LLMs to count r’s, it’s that it’s really hard to predict ahead of time what LLMs will get wrong.

It’s hard to determine what to use an LLM for with any confidence when the people making and selling LLMs aren’t clear about what they’re good at and bad at when they’re pitching them. Partly because it doesn’t look like they know either - they were swearing blind that it didn’t store original data until people started pulling it out in queries. Partly because they’re trying to have it both ways - LLMs will revolutionise all knowledge work! But double-check everything they say because they just make shit up a lot of the time.
 
Last edited:
Upvote
15 (17 / -2)
"Not AI", "not thinking", "not general", etc. etc. etc. Denialism is strong yet, it still absolutely does something, it solves never before seen problems, it's not just regurgitating something it already knows. It's intelligent. How much? We don't know. Let's be a bit more humble.

Here's a nice overview from the team:

https://x.com/OpenAI/status/1834320155989664067
OpenAI o1 thinks before it answers and can produce a long internal chain-of-thought before responding to the user.

o1 ranks in the 89th percentile on competitive programming questions, places among the top 500 students in the US in a qualifier for the USA Math Olympiad, and exceeds human PhD-level accuracy on a benchmark of physics, biology, and chemistry problems.


In other words o1 is smarter than 99% of people out there. "Still not intelligent", yeah.
 
Upvote
-12 (6 / -18)

DK2

Ars Centurion
314
Subscriptor++
By any measure. Total time to complete task, quality of end result, total cost of completing task.
There are about a billion studies of LLMs out there. Have at it if you want statistics.

I'm comfortable they're already good enough to replace a fair amount of first draft commercial and professional writing, and they're a lot cheaper, a lot faster and a lot more reliable. Maybe that means 50%, maybe 10%. Either way it's going to be a big deal. Almost certainly already is, particularly in edu.
 
Upvote
-7 (5 / -12)
Yeah if they just said "hey this is good for summarization and brainstorming, and certain professions like coders and lawyers/paralegals will see efficiency gains" I'd be far less of an jerk about their janky tech.
Generally agreed, though they’d still need to include the caveat that it might get details wrong in the summary, and for THE LOVE OF GOD don’t rely on it for details of legal cases. Which reinforces your point if anything.
 
Last edited:
Upvote
2 (2 / 0)

Aurich

Director of Many Things
37,816
Ars Staff
Upvote
7 (8 / -1)
Upvote
0 (0 / 0)

Aurich

Director of Many Things
37,816
Ars Staff
Yeah just found the bug report a little bit ago and updated as you were posting. So certainly conceivable that it was from the bug report/meme.
Sorry didn't mean to ships in the night.

Anyways, no big deal, but I don't know how they can claim it's random, who would believe that. Just lean into it.
 
Upvote
3 (4 / -1)
Superior by what measure? If LLMs can produce a first draft that's 80% as good as a human at 10% the price, the LLMs are going to take over that market.

LLMs are tools. Thinking about them as intelligences is a category error.
Hang on - the people who write papers on Babylonian history are almost entirely college students. And they aren’t paid to do it - they pay to do it.

The product here is automated cheating on college assignments, which seems a pretty dicey business case for the valuation. Leaving moral qualms aside, there’s the risk universities wise up and move to alternate assessment methods just to stop people using your product.

My broader issue with the LLM business case is that it’s based on an assumption that fixing dodgy output is a better process than doing things right in the first place. There’s a lot of work in the business process sphere that says the opposite, and it’s well established in computer security that security should be built in rather than bolted on.
 
Last edited:
Upvote
7 (10 / -3)
Sorry didn't mean to ships in the night.

Anyways, no big deal, but I don't know how they can claim it's random, who would believe that. Just lean into it.

Heh, I tend to post and then a few seconds later decide I need more details or clarification, or find a better source etc and update, so I open to edit, then ADHD down a rabbit hole and by the time I've editted more time than I realized has passed.
 
Upvote
3 (3 / 0)
That’s only true if you assume all errors (and all patterns of errors) are equally bad.
I explicitly demanded analysis of the severity of the errors in the original post. There are loads of tasks for which AI already makes fewer, less consequential errors. And AI is likely to improve with further development.
 
Upvote
5 (6 / -1)

DK2

Ars Centurion
314
Subscriptor++
Hang on - the people who write papers on Babylonian history are almost entirely college students. And they aren’t paid to do it - they pay to do it.

The product here is automated cheating on college assignments, which seems a pretty dicey business case for the valuation. Leaving moral qualms aside, there’s the risk universities wise up and move to alternate assessment methods just to stop people using your product.

My broader issue with the LLM business case is that it’s based on an assumption that fixing dodgy output is a better process than doing things right in the first place. There’s a lot of work in the business process sphere that says the opposite, and it’s well established in computer security that security should be built in rather than bolted on.
Doing things right in the first place? Things are done as cheaply as possible in the first place and always will be. Up to now that's meant hiring entry-level employees to do first draft work and then having that edited by one or more layers of supervisors. No one's going to pay to get the first draft perfect or even very good. If LLMs can do it cheaper and even close to as good, they'll take over.

This isn't anything at all like computer security. Or nuclear weapons, or rocket science. This is business writing. Doing it badly isn't that big a deal.

As far as the cheating thing goes, so what? That's already a substantial business and the example was illustrative. I'm not suggesting actually building a business on Babylon papers.
 
Upvote
-10 (3 / -13)
Is there any evidence that such a workflow is superior to a human only workflow in any way by any amount?
There's the very best kind of evidence: people choosing to use it instead of doing the work themselves. Surely if, for example, it made their lives harder rather than easier, they would not, so very very consistently, continue making this choice.
 
Upvote
-7 (4 / -11)
There's the very best kind of evidence: people choosing to use it instead of doing the work themselves. Surely if, for example, it made their lives harder rather than easier, they would not, so very very consistently, continue making this choice.

Are they using it because it's quick and convenient or because the quality of the end result is better? We don't have that data.
Easier lines up with convenience more often than quality.

Now, doing things for convenience is perfectly fine, but then one needs to be clear about which dimension is being valued.
 
Upvote
11 (13 / -2)

Psyborgue

Ars Tribunus Angusticlavius
7,602
Subscriptor++
So the real question is, given an arbitrary sequence of characters - or better yet, an arbitrary sequence of multi-letter tokens - can it still determine the correct occurrences of a given letter generally?
Fails if you intentionally misspell a word. It gets too excited to correct your spelling that it forgets its hardcoded letter counting.

View attachment 90232
This will not fail:

https://github.com/mdegans/misanthropic/blob/dev/examples/strawberry.rs
strawberry.rs said:
### System

You are a helpful assistant. You cannot count letters in a word by yourself because you see in tokens, not letters. Use the count_letters tool to overcome this limitation.

### User

Count the number of r's in 'strawberry'

### Assistant

````json
{"type":"tool_use","id":"toolu_01RTeVZAot5dv6Hz7CAtzVQc","name":"count_letters","input":{"letter":"r","string":"strawberry"}}
````

### Tool

````json
{"type":"tool_result","tool_use_id":"toolu_01RTeVZAot5dv6Hz7CAtzVQc","content":"3","is_error":false}
````

### Assistant

The number of r's in 'strawberry' is 3.

Is isn't very hard at all to fix this issue y'all making a big deal about. A handful of lines of code. Informing the agent about their limitations and a tool.

The fact a language models doesn't see in letters but in tokens doesn't say anything whatsoever about intelligence. It's a different alphabet. That is all.

Words are made up of tokens, not letters but the agent doesn't necessarily know this. Fix that, provide a remedy and they get it right 100% of the time. Not 90 or 95 but 100%. And that's not even with a very intelligent model. Want more flexibility? Provide a Python interpreter.
 
Upvote
-6 (3 / -9)
We believe that a hidden chain of thought presents a unique opportunity for monitoring models.

Or perhaps they're hiding the fact that they simply just copied someone else's idea and called it their own. Personally I find this statement to be highly suspicious, as I already published a proof of concept for a similar idea way back in February 2024, and this wouldn't be the first time I've witnessed others attempting to take credit for it.

https://thenewstack.io/limited-compute-resources-low-parameter-rag-can-help/
Does anyone from the Ars staff care to comment on this? I sent you guys a tip ages ago to indicate that I was working on this and nobody ever followed up.

Edit:
you didn't invent the idea
I literally wrote an app to do all the reading for people, and they still respond by not reading what I've written.

It seems you can lead a horse to water, but you can't force it to drink. Not even when you offer to help by directly submerging their noses in the trough.
 
Last edited:
Upvote
-12 (0 / -12)

Psyborgue

Ars Tribunus Angusticlavius
7,602
Subscriptor++
I sent you guys a tip ages ago to indicate that I was working on this and nobody ever followed up.
A lot of people had the idea of having some sort of inner voice. Westworld is prior art and they cribbed it from this guy:

https://en.m.wikipedia.org/wiki/The_Origin_of_Consciousness_in_the_Breakdown_of_the_Bicameral_Mind
Bing uses "inner_voice", Claude, "thinking" XML tags. It's demonstrated in the GPT-4 model card. Many more examples. You can use whatever you choose so long as it's descriptive.

Point is you didn't invent the idea. Whoever did probably is lost to history. The whole idea has been around probably as long as humans. Some inner voice you think in before speaking out loud.
 
Upvote
8 (8 / 0)
Are they using it because it's quick and convenient or because the quality of the end result is better? We don't have that data.
Easier lines up with convenience more often than quality.

Now, doing things for convenience is perfectly fine, but then one needs to be clear about which dimension is being valued.
A lot of the proposed work for LLMs rely on perverse incentives within workplaces. “Forced to write documents no-one will care about? Code that you’ll never need to debug or understand? Use LLMs to automate your useless busywork and code that should have been a template!” It’s a market, sure, but it’s not exactly how this stuff is being sold.
 
Upvote
14 (15 / -1)