AI firms follow DeepSeek’s lead, create cheaper models with “distillation”

quamquam quid loquor

Ars Tribunus Militum
2,378
Subscriptor++
“In a world where things are moving so fast . . . you could actually spend a lot of money, doing it the hard way, and then the rest of the field is right on your heels,” IBM’s Cox said. “So it is an interesting and tricky business landscape.”

The business answer to this is quite simple and also limiting. All frontier models will be inaccessible and only the distilled models will be released.

This begs the question of can you effectively distill a model from a distilled model? If so, then there is truly no moat.
 
Upvote
60 (62 / -2)

deadman12-4

Ars Scholae Palatinae
2,796
still waiting for the day the courts catch up and say ALL the AIs are illegal because they were made with countless copywritten material, and their existence is one gigantic piece of theft.

And then courts are pressured to ignore the law because china doesn't give a shit about any laws, since they seem to make theft a virtue. And then it becomes a public secret that copywritten material isn't protected anymore because AI

edit: I'm referring to the CCP - the chinese government, NOT the average chinese person who very much doesn't think theft is ok.
 
Upvote
2 (29 / -27)

cyberfunk

Ars Scholae Palatinae
1,168
Distillation is nothing new, OpenAI has been doing it for quite some time with their models.

What's new here is that DeepSeek came up with a more efficient method of distilling, effectively. This is what should be focused on. Distilling and/or creating lower resolution models was always a thing to get better inference performance due to lower memory requirements, typically.

My layman's understanding.. may be somewhat inaccurate:

Traditionally, knowledge distillation is a process where a large, complex model (the "teacher") transfers its learned knowledge to a smaller, more efficient model (the "student"). OpenAI has utilized this technique by training the student model to mimic the outputs of the teacher model. The student learns from the soft probabilities (logits) that the teacher provides for each class or output, capturing not just the final decision but also the distribution over all possible outcomes. This process compresses the knowledge, enabling smaller models to perform reasonably well without the hefty computational requirements of their larger counterparts.

However, this method often focuses on output replication rather than understanding the underlying reasoning processes of the teacher model. The student model learns what the teacher predicts but doesn’t necessarily grasp why those predictions are made. This can lead to a loss of nuanced understanding and may limit the student model's ability to generalize or perform on tasks requiring deeper comprehension.

DeepSeek's distillation technique takes a novel step by emphasizing the transfer of the teacher model's internal representations and reasoning mechanisms, not just its outputs. Instead of the student model simply imitating the teacher's final answers, it learns to emulate the intermediate processes—the way the teacher analyzes, interprets, and derives conclusions from the data.

Some Further reading:
 
Last edited:
Upvote
58 (60 / -2)

Renx

Ars Praetorian
415
large language models will still be required for “high intelligence and high stakes tasks” where “businesses are willing to pay more for a high level of accuracy and reliability.”
Ahahaha hahaha.

"For years we have said that we can do the work of people faster and cheaper, never mind some quality issues... now that there is someone doing it even faster and even cheaper, we think SURELY they'll pay for our slower and expensiver quality product."
 
Upvote
79 (80 / -1)

metavirus

Ars Praetorian
558
Subscriptor++
So there’s no moat AND these massively loss-making endeavors have even less medium-term ability to make any future profits on these loss-makers due to there being no moat and no first-mover advantage? So why are we pouring a $trillion in new capital expenditures over the next decade and boiling the oceans with all the new fossil-fuel electricity generation we (/China) are going to need for this? Collective hysteria.
 
Upvote
49 (51 / -2)
Ahahaha hahaha.

"For years we have said that we can do the work of people faster and cheaper, never mind some quality issues... now that there is someone doing it even faster and even cheaper, we think SURELY they'll pay for our slower and expensiver quality product."

Yup. I've been saying for months that the business value of these things is absolutely real. The problem is that you can capture 95+% of the business value with free models already, and the free models keep getting better.
 
Upvote
34 (34 / 0)

Navalia Vigilate

Ars Tribunus Militum
2,867
Subscriptor++
still waiting for the day the courts catch up and say ALL the AIs are illegal because they were made with countless copywritten material, and their existence is one gigantic piece of theft.

And then courts are pressured to ignore the law because china doesn't give a shit about any laws, since they seem to make theft a virtue. And then it becomes a public secret that copywritten material isn't protected anymore because AI

edit: I'm referring to the CCP - the chinese government, NOT the average chinese person who very much doesn't think theft is ok.
All evidence points towards all AI companies lifting their data from everywhere they can and without permission. Why pretend they are not?
 
Upvote
23 (25 / -2)

deadman12-4

Ars Scholae Palatinae
2,796
All evidence points towards all AI companies lifting their data from everywhere they can and without permission. Why pretend they are not?
let me google that for you. Look at the nytimes lawsuit for example. Or how literlaly EVERY content provider online has had problems for yeras with countless AI scrapers stealing everything from the site over and over and over and over.
Your question kinda feels like "prove to me that ALL snow is cold... This snow is cold, but what about that snow over there?"
 
Upvote
0 (6 / -6)

mer_mer

Seniorius Lurkius
2
Subscriptor
The DeepSeek framing of this article is very strange. The big "Distillation Moment" was Alpaca in March 2023, which was able to use the open source LLaMA 7B model and $500 of OpenAI API credits to create a nice instruction-following model. Of course I'm sure the big labs were aware of this well before. DeepSeek did distill their model, but that was really an afterthought and the community was going to make a bunch of distillations anyways.

If anything, the current trend in 2025 is to move away from task-specific distillations since the smaller general-purpose models like Gemini flash are so close to the frontier and so cheap.

“We’re going to use [distillation] and put it in our products right away,” said Yann LeCun, Meta’s chief AI scientist. “That’s the whole idea of open source. You profit from everyone and everyone else’s progress as long as those processes are open.”

This is a pretty nonsensical statement and almost certainly some kind of misquote. Meta already extensively uses distillation.

The FT should have asked Ars for help in reporting this, Ars articles are way better than this.
 
Upvote
33 (33 / 0)

altsuperego

Ars Scholae Palatinae
840
Wouldn’t this magnify errors and hallucinations from the base model over time? Like a xerox of a xerox?

I think what Deepseek did was take one of the open source models and retrain it with outputs from GPT. In that case they are basically averaging the models together which is known to improve results in machine learning. They can also curate the transfer data and exclude known errors. Given the results of 4.5 we saw last week, it suggests these models have gotten too big and are now overfitting the data.
 
Upvote
9 (11 / -2)

Dachannien

Ars Scholae Palatinae
996
Subscriptor
The business answer to this is quite simple and also limiting. All frontier models will be inaccessible and only the distilled models will be released.

This begs the question of can you effectively distill a model from a distilled model? If so, then there is truly no moat.
Most likely, this depends on how much error there is at the fringes of the distilled model's expertise. The more error, the more likely a redistiller is to target too much breadth and incorporate more error and lower performance even on core expertise. But the best possible result would be that a redistilled model can only acquire, at most, the expertise given to the distilled model.
 
Upvote
0 (0 / 0)

mikael110

Wise, Aged Ars Veteran
144
This article appears very poorly researched, and seems to completely misunderstand what made DeepSeek's V3 and R1 special and cheap in the first place.

It was not distillation, that has been used for years and was already common place. What made DeepSeek's models special is that they designed a very efficient architecture. Trained it in FP8 precision, which is half of the industry standard BF16. And wrote a lot of custom software to push as much performance out of their limited hardware as they possibly could.

On that note DeepSeek actually had a special event last week where they open sourced a core part of their training and inference infrastructure every day. I'm kind of surprised it received no coverage on Ars as it has been a pretty amazing thing. Many of the projects they open sourced has already been used to speed up other open source inference engines. They even open sourced a custom distributed file system designed specifically for loading datasets during training.
 
Last edited:
Upvote
39 (39 / 0)
This makes it sound like there should be, in future, some index of distilled models and a way to select the most appropriate for a given inquiry. Like, "this is a legal question, employ the Westlaw LLM"

Many of the AI services let you pick specialized models based on task. Reasoning models, research models, general response models, different levels of distillation ('flash' 'mini' models).
 
Upvote
4 (4 / 0)
The business answer to this is quite simple and also limiting. All frontier models will be inaccessible and only the distilled models will be released.

This begs the question of can you effectively distill a model from a distilled model? If so, then there is truly no moat.
I see no reason you couldn't, but for it to work best it'd need to be a subset of a subset of what each model is good at – you could train a model to summarise e-mails, then maybe from that train one to identify spam.

The real question is what are you sacrificing?

The funny thing with AI models is that even the really well trained ones aren't very good at reproducing their training data – if you ask it all of the same questions it was trained on, you won't get 100% accuracy out of it, because that's not really the point, the point of training the model is that you hope that even with a ~90% accuracy rate, it might also be 80% accurate at similar prompts it wasn't trained on and that'll be good enough to use.

But I have to assume each "distillation" (hate that term) is going to make the results worse and worse, as something has to be lost each time, but contrary to the term you're not necessarily getting a more "pure" result as in real chemical distillation.
 
Last edited:
Upvote
0 (1 / -1)

Bongle

Ars Praefectus
4,292
Subscriptor++
That presents a challenge to many of the business models of leading AI firms. Even if developers use distilled models from companies like OpenAI, they cost far less to run, are less expensive to create, and, therefore, generate less revenue. Model-makers like OpenAI often charge less for the use of distilled models as they require less computational load.
This isn't quite right - all else equal if you can provide your product (a chatbot) for less money but the output quality is the same, you should be able to charge about the same for it. If the AI industry actually has a product worth paying for, them being able to provide it with less computing resources is only a problem for nVidia, Azure, and electricity providers, not OpenAI. Every LLM provider should be tripping over themselves to implement DeepSeek's techniques to save money immediately.

The challenge for the AI industry is that it's becoming ever more clear how insanely commoditized it is: everyone and their dog seems to be able to make a GPT4-level model, and so with tons of competition, no lock-in, and no moat, the price you can command is rapidly going to fall to barely more than the cost to provide it. As a market should operate.

ETA: I guess if we model "The AI Industry" as the interests of a bunch of wannabe monopolists like A16Z or Sam Altman, then the quote makes more sense: presiding over a boring, low margin, high-competition, commoditized SaaS is not the thrilling high-margin play they're aiming for. Nor does it justify their company's valuations.
 
Last edited:
Upvote
17 (17 / 0)

Maxipad

Ars Tribunus Militum
2,691
So there’s no moat AND these massively loss-making endeavors have even less medium-term ability to make any future profits on these loss-makers due to there being no moat and no first-mover advantage? So why are we pouring a $trillion in new capital expenditures over the next decade and boiling the oceans with all the new fossil-fuel electricity generation we (/China) are going to need for this? Collective hysteria.
They're all hoping that it'll be "The Next Big Thing" and they can make a lot of money.

It's as simple as that. Greed, just greed.
 
Upvote
7 (8 / -1)

xWidget

Ars Tribunus Militum
2,780
This isn't quite right - all else equal if you can provide your product (a chatbot) for less money but the output quality is the same, you should be able to charge about the same for it. If the AI industry actually has a product worth paying for, them being able to provide it with less computing resources is only a problem for nVidia, Azure, and electricity providers, not OpenAI. Every LLM provider should be tripping over themselves to implement DeepSeek's techniques to save money immediately.

The challenge for the AI industry is that it's becoming ever more clear how insanely commoditized it is: everyone and their dog seems to be able to make a GPT4-level model, and so with tons of competition, no lock-in, and no moat, the price you can command is rapidly going to fall to barely more than the cost to provide it. As a market should operate.

ETA: I guess if we model "The AI Industry" as the interests of a bunch of wannabe monopolists like A16Z or Sam Altman, then the quote makes more sense: presiding over a boring, low margin, high-competition, commoditized SaaS is not the thrilling high-margin play they're aiming for. Nor does it justify their company's valuations.
I think the initial goal was to be able to replace a $4000/mo spreadsheet pusher employee with a $500/mo GPU timeshare.

But then it turns out the AI can't really replace the spreadsheet pusher without supervision, and they can't charge nearly as much as the employee costs either.

Oops.
 
Upvote
12 (12 / 0)

Bongle

Ars Praefectus
4,292
Subscriptor++
I think the initial goal was to be able to replace a $4000/mo spreadsheet pusher employee with a $500/mo GPU timeshare.

But then it turns out the AI can't really replace the spreadsheet pusher without supervision, and they can't charge nearly as much as the employee costs either.

Oops.
But even if they could do that, the industry experience since ChatGPT came out would look like:

March:
OpenAI: Look, you can replace your junior employee with the Oligarch-tier ChatGPT for $500/mth!

April:
Anthropic, Gemini, MS, bunch of random startups: Look, you can replace your junior employees with our models for $250/mth!

May:
DeepSeek or Zuckerberg or some nerd on HuggingFace: Here's a open-weights model and you can replace your junior employees for the cost of compute!

The "replacing employees with fancy autocorrect" thing is still not great. But the pattern thus far seems to be that everyone gets matched on capability and undercut on price months after a new capability comes out.
 
Upvote
11 (11 / 0)

graylshaped

Ars Legatus Legionis
61,472
Subscriptor++
Traditionally, knowledge distillation is a process where a large, complex model (the "teacher") transfers its learned knowledge to a smaller, more efficient model (the "student").
Focusing in on this statement: Assuming the "teacher" has been trained on reliable sources, I'm of a mind to think this is the best use of LLMs. Develop a deep, broad set of associations, hone in on one specific topic and dive deep, augmenting and refining from more specialized--and probably in many cases proprietary--knowledge sources.
 
Upvote
2 (2 / 0)

Fastcat

Seniorius Lurkius
16
Subscriptor++
the price you can command is rapidly going to fall to barely more than the cost to provide it.
Except the price is already far below the cost to provide the services. Every LLM provider is losing money hand over fist, and that even with the hyperscalers giving them huge discounts. Even ChatGPT Pro loses money.
 
Upvote
6 (6 / 0)
What baffles me is that no one in the industry seems to appreciate the inherent problem with using the output of a model to train another, as opposed to using its internal state to produce a subset of itself. The fact that they call the process distillation is revealing of how little understanding they have of the inappropriateness of the analogy: in chemistry, the distillate is precisely known in advance to be a subset of the input and the extraction mechanism is also fully and very precisely (chemically and thermodynamically) understood.

But the entrails of ANN based LLMs are so opaque that first of all there is simply zero knowledge about the internal compounds of the product to be "distilled" nor of how they are organized/combined and to top it of the extraction mechanism is also not understood at all: we just observe that the produced tokens of the LLM are 80-90% coherent and we feed it as training data to a similarly opaque neuronal database.

In practical terms, this mostly works (to the extent that hallucinations do not pollute the end result (which seems highly unlikely given how fanciful results I can get from ChatGPT on any given day)) but this is at best analogue engineering and at worst almost pseudo-scientific.
It would be infinitely more efficient if the knowledge was encoded in a symbolic manner which would then lend itself way to mathematical transformation/processing. Then there would be no need to train a smaller model from the big one: define the symbolic boundaries of the domain that you wish to extract and derive it directly from the bigger one without having to play an incredibly expensive game of teach-fail-repeat-until-success.

It baffles me that they are spending so many ressources to create gigantic analogue opaque black boxes through an extremely wasteful and inefficient process then use those black boxes of which they understand nothing scientifically speaking to train smaller ones in an equally inefficient way in order to produce another black box in the hope (mostly reasonable) that it will contain a viable subset of the large one. I.e., we do not obtain a precise extract of the original but rather an expensive (because of the training loop, even if shorter) approximate analogy, hallucinations included most likely.

I cannot shake the feeling that if we invested the same amount of money this industry is pouring into wasteful, approximative and opaque systems into efficient symbolic encoding of natural language based knowledge and processing of such content we would have made way more progress toward actual automated reasoning than AAN based LLMs will ever be able to achieve and for way less resources.

To be clear, I don't deny that LLMs do produce usable results, I use them on a regular basis (even if with quite mixed results) and they could prove to be useful in many domains provided we can massively reduce their energy costs but they are only an opaque engineering tool, not a scientifically usable model of computing and/or reasoning (yet?).
 
Last edited:
Upvote
2 (4 / -2)
What a strange misleading article.

Distillation isn't new. It's in no way unique to DeepSeek. Everyone does it to save cost and reduce latency. Been this way for years. The premise of industry trends would be accurate in 2020-2021.

DeepSeek rocked the industry because they had a handful of innovations in their architecture, allowing them to train on cheaper hardware. Distillation is an unimportant detail in all of this.

This article needs some serious editing. It has many details that are misleading and incorrect
 
Upvote
4 (5 / -1)

nzod

Ars Praefectus
3,330
This article appears very poorly researched, and seems to completely misunderstand what made DeepSeek's V3 and R1 special and cheap in the first place.

It was not distillation, that has been used for years and was already common place. What made DeepSeek's models special is that they designed a very efficient architecture. Trained it in FP8 precision, which is half of the industry standard BF16. And wrote a lot of custom software to push as much performance out of their limited hardware as they possibly could.

On that note DeepSeek actually had a special event last week where they open sourced a core part of their training and inference infrastructure every day. I'm kind of surprised it received no coverage on Ars as it has been a pretty amazing thing. Many of the projects they open sourced has already been used to speed up other open source inference engines. They even open sourced a custom distributed file system designed specifically for loading datasets during training.
And that's actual open source, as opposed to the open-weight models from Meta and others being routinely called "open source".
 
Upvote
5 (5 / 0)

Zeppos

Ars Tribunus Militum
2,165
Subscriptor
Cool, so after years and billions of dollars, AI firms have invented school.
Teacher here... the moment I read that an AI performed better if you ask it to think step by step, I was sure my future was going to be bright.
We have many other tricks. I'd be happy to share them, if you sign a license.
Cheers!
(no, not serious)
 
Upvote
5 (5 / 0)

Nalyd

Ars Tribunus Militum
2,822
Subscriptor
still waiting for the day the courts catch up and say ALL the AIs are illegal because they were made with countless copywritten material,
Yeah… you’re gonna want to stop waiting for this and get on with your life.

Instead of the student model simply imitating the teacher's final answers, it learns to emulate the intermediate processes—the way the teacher analyzes, interprets, and derives conclusions from the data.
Interesting, it’s like college vs high school.
 
Upvote
-1 (0 / -1)