CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

steelcobra · Jul 24, 2024

gruberduber said:
Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?

If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.

Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.

How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.

"It's not what you know, it's who you know."

Networking is the name of the game. That's why rich parents fund their kids while they work as unpaid interns to F500 CEOs, so they're meeting people and getting them contacts on boards to become future C-Suiters.

Mungus the Unhyphenated · Jul 24, 2024

azazel1024 said:
Fun little anecdote about Crowdstrikes CEO George Kurtz.

In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]

Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.

So, not his first rodeo of insufficient testing and bad practices in a group he is leading...

Oh god, I was there for that. Where I was working, we stopped the update push from the local relay servers before everything was b0rked. It was still a long day.

Thank goodness the McAfee architecture relied on local relays; an external cloud service shoving the update onto all machines would have caused, well, the same as the CrowdStrike snafu.

Xavin · Jul 24, 2024

starglider said:
I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.

It's actually many more than that, those are just the ones unlucky enough to poll for an update before they realized something was wrong and pulled it.

NetMage · Jul 24, 2024

evan_s said:
I wonder if this same crash would be exploitable as a denial of service attack by crashing the machines or possibly even a Root level code execution? I won't be at all surprised if this is followed up by one or both of those things.

I think by the time you (malware) have enough access to be able to corrupt a channel definition file on a device, you could just crash Windows yourself more easily. The files are not normally accessible.

Of course, that doesn’t mean CrowdStrike channel updates wouldn’t be an awesome supply chain attack target.

whquaint · Jul 24, 2024

tjukken said:
"full Root Cause Analysis". Well, one of the problems was that their software is running as root.

Knee slapper

evan_s · Jul 24, 2024

NetMage said:
I think by the time you (malware) have enough access to be able to corrupt a channel definition file on a device, you could just crash Windows yourself more easily. The files are not normally accessible.

Of course, that doesn’t mean CrowdStrike channel updates wouldn’t be an awesome supply chain attack target.

One would hope so but given how badly they failed at this I wouldn't bet on it. For all I know they may have set up file associations so you just need to get the file downloaded anywhere and get windows to try to open it.

Lord Grey · Jul 24, 2024

Anyone want to bet that CrowdStrike's "Content Validator" is just badly-written regex?

psko · Jul 24, 2024

I assume that .sys with all zeros was zeroed somewhere in the journey to deployment, but: 1) why it was not signed or at least to have a checksum 2) why not tested on real Windows and blessed that this file with certain signature/checksum is OK 3) why CrowdStrike does not check signatures/checksum of each file after downloading the package (even if the whole package is checked, and assuming that they distribute in packages - however, now I read that they download one by one?) 4) why after writing the file they don't check the signature/checksum (to ensure no disk-errors or active malware altering the contents) 5) finally, why kernel driver loading that file does not check the signature/checksum before running/interpreting or whatever it does with it?

ad 5) could that sys be used to inject malicious code to be run in ring-0? This is kinda worrying...... if they can accept zeroed .sys file and interpret it/run it just like that, then probably could be used in a very nasty way?

I get it why they need to use ring-0 driver for it - I've just heard that Microsoft wanted to provide such APIs for AV companies and apparently EU blocked the idea as it could harm competition (YT Dave's Garage). Is EU to be blamed, too? Or maybe MS wanted to charge zillions for having the access to such APIs and smaller AV companies could not effort it? I dunno.

RamboIT · Jul 24, 2024

MedicinalGoat said:
QA/QC doesn't return tangible value on the quarterly report so it must be a waste.

It's a fixed cost that bean counters turned into a variable cost, thanks to Agile. And we all know how the variance goes, as close to zero as possible.

joncaplan · Jul 24, 2024

cyberfunk said:
I would like to know how you propose to run malware detection and analysis from an unprivileged position that's not allowed to do things like arbitrary memory and process and file inspections ?

Good question. Got an interesting answer to that one in a video that was posted by a retired Microsoft engineer, Dave Plummer.
Microsoft had developed a set of APIs to allow this type of inspection from user space processes. He explains that it was nixed by European regulators over concerns about Microsoft using their control over these APIs to harm smaller competitors in the security software space. This starts at about 5:20 in the video

CeeTee · Jul 24, 2024

Test is a fine idea, but how about a bit more robustness and exception handling in the driver itself - e.g. a null dereference etc should be anticipated and defended against.

Kjella · Jul 24, 2024

DistinctivelyCanuck said:
I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...) In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone) Because "manual" QA is too slow.

I don't think you need to narrow it down because I think it's all of them. For the last 20 years I've heard nothing but Agile, DevOps, CI/CD, TDD etc. saying any code should be ready to deploy the moment it passes the tests and that anyone following a waterfall style release pipeline with handovers and quality gates is a dinosaur that should be put out to pasture. Preferably in good combination with the belief that you'll get good code from a revolving door of consultants because they're only taking over work that's "done". Oh lordy...

Maltz · Jul 24, 2024

starglider said:
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.

Pilots are actually trained about "accident chains" and to keep them in mind when making decisions to try to break such chains. Any really big mistake is just the culmination of a trail of little ones. Details matter.

joncaplan · Jul 24, 2024

GrumpyExSpaceDude said:
This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"

"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."

https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html

I'd give a more generous reading of this disclaimer. I believe the intended meaning of not being fault tolerant here is that If a fault occurs, the process or system may go down, which is fine in many domains which don't have critical requirements for responsiveness. However, if I'm developing for or managing a system which requires fault tolerance, such as avionics, a medical device or control system for a chemical or nuclear facility, where faults must be handled without affecting the responsiveness or behavior of the software, I would appreciate a vendor clearly stating that their software component is not fault tolerant, so I could exclude it from consideration.

Fatesrider · Jul 24, 2024

So where'd they get the "Content Validator"?

I got the impression that while they're pointing and shouting at what they want the rest of us to think was the the scene of the crime, they're quietly erasing the evidence of their authorship of said content validator while no one is looking.

Frosty Grin · Jul 24, 2024

real mikeb_60 said:
Many of the aircraft investigations I've seen have a strong flavor of "how well the holes in the swiss cheese slices lined up" to them. Somehow, I don't expect that in this case, but it's possible.

In this case, it's all holes, no cheese. That's what's shocking. It would be one thing if they tested the update on real computers and it passed the tests somehow. But they weren't. It would be one thing if their staggered release failed - but they didn't have it.

bifrost · Jul 24, 2024

joncaplan said:
Good question. Got an interesting answer to that one in a video that was posted by a retired Microsoft engineer, Dave Plummer.
Microsoft had developed a set of APIs to allow this type of inspection from user space processes. He explains that it was nixed by European regulators over concerns about Microsoft using their control over these APIs to harm smaller competitors in the security software space. This starts at about 5:20 in the video

Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.

cyberfunk · Jul 24, 2024

Ganz said:
From the response:

They should have been doing this already. This is what the lawsuits should hinge on.

Yes , that was the point I was making .

nsimpson · Jul 24, 2024

While the escape of the bad config file into the wild was bad, the real problem here is that their application code didn't do any sanity checking on the data before trying to use it. It's bad enough for a modern application to crash because of badly formatted data, it's totally unnacpetable for code running in the Kernel to fail this way. It's made even more maddening when you know that they managed to do the exact same thing to a bunch of RHEL instances a few months ago and this didn't prompt them to take any action!

cbreak · Jul 24, 2024

cyberfunk said:
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.

And that they wrote their crappy code to run in kernel space instead of doing at least the parsing of files and analysis in userland.
And that they obviously didn't even do basic testing / fuzzing of said parser code.
And that they didn't fail safely instead of crashing the whole system in an unrecoverable way.

Rosyna · Jul 24, 2024

FinallyAnAccount said:
So are they saying that they relied on a content validator instead of pushing to an actual system? The fact that this happened to every Windows system it touched is damning.

From the CrowdStrike blog:

Rapid Response Content is delivered as “Template Instances,” which are instantiations of a given Template Type

These “Template Types” sound like binary representations of classes that are saved to disk and then read back into memory to instantiate structures.

Think NSArchiver on macOS.

Yes, I explicitly compared it to NSArchiver because NSArchiver was a massive target of malicious substitution attacks as it did not conform to NSSecureCoding.

Sounds like CrowdStrike might have a similar problem in that they a deserializing code without having proper checks that the serialized version is good?

blackBrant · Jul 24, 2024

Ideally, software vendors selling to the federal government would be required to have their software development process audited. There is an organization that understands what good looks like:
https://www.sei.cmu.edu/about/history-of-innovation-at-the-sei/index.cfm

steelcobra · Jul 24, 2024

cbreak said:
And that they wrote their crappy code to run in kernel space instead of doing at least the parsing of files and analysis in userland.
And that they obviously didn't even do basic testing / fuzzing of said parser code.
And that they didn't fail safely instead of crashing the whole system in an unrecoverable way.

On that last line, it's a general OS safety issue. Windows hard crashes to prevent that kind of activity from causing damage to the system and OS from a bad system call.

And Linux and Mac do it too, they just weren't forced to allow third parties access to the kernel ring.

cbreak · Jul 24, 2024

steelcobra said:
No, on this Microsoft is stuck, and beholden to their EU anti-trust agreements from the oughts forcing them to allow 3rd party software to operate like this.
https://www.tomshardware.com/softwa...oid-crowdstrike-like-calamities-in-the-future

Bullshit.

They could easily write some userland API for security software and provide these capabilities safely. Or at least they should try to, who knows if MS still has enough skilled devs to actually pull it off. Apparently they weren't able to, so they wanted to still use kernel level access themselves. And THAT is the problem.

steelcobra · Jul 24, 2024

cbreak said:
Bullshit.

They could easily write some userland API for security software and provide these capabilities safely. Or at least they should try to, who knows if MS still has enough skilled devs to actually pull it off. Apparently they weren't able to, so they wanted to still use kernel level access themselves. And THAT is the problem.

Did you miss that this was an EU-mandated requirement, not a choice they were given?

cbreak · Jul 24, 2024

steelcobra said:
On that last line, it's a general OS safety issue. Windows hard crashes to prevent that kind of activity from causing damage to the system and OS from a bad system call.

And Linux and Mac do it too, they just weren't forced to allow third parties access to the kernel ring.

Neither is windows. Either microsoft or crowdstrike could have easily provided a userland API (via a minimal kernel module), and used that in userland to do all the heavy lifting. No one forced crowdstrike to parse untrusted and potentially (and obviously actually) malformed data in kernel space.

cbreak · Jul 24, 2024

steelcobra said:
Did you miss that this was an EU-mandated requirement, not a choice they were given?

It was absolutely a choice. And they chose poorly. Try reading next time.

NetMage · Jul 24, 2024

Rosyna said:
These “Template Types” sound like binary representations of classes that are saved to disk and then read back into memory to instantiate structures.

No, the template types are pre-defined functions that do certain types of scanning or logging and they are sent out and updated with full CrowdStrike code updates, which obeyed e.g. N-1 update delays. The channel definition file contains parameters to be passed to execute the procedures that were included the previous code updates.

Which means the N-1 was meaningless for template code, since it’s execution was controlled by the channel definition sent later. It will most likely not be executed for a while until even delayed users have the untested code ready to be executed (!).

cbreak · Jul 24, 2024

bifrost said:
Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.

That would mean the malware already has root-equivalent permissions, so it could also inject code into the kernel and kill it that way (for example by replacing the channel files used by crowdstrike with even more malicious ones).
And it'd be a gigantic red flag for the kernel side component of the detector that something's up. That's not what a stealthy malware would want to do.

Great_Scott · Jul 24, 2024

markgo said:
So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.

I find the Crowdstrike admission odd.

How can there be testing bugs when they aren't doing any testing?

charliebird · Jul 24, 2024

I've worked with Crowdstrike several times and I've always found them to be arrogant and condescending. I hope they learn from this incident that they need some modesty and humility in their approach.

Rosyna · Jul 24, 2024

Maarten said:
It seems that they still have an unchecked NULL pointer dereference in the main code. Even with the faulty "content configuration update" that code should not crash.

Why are you under the impression there was a NULL dereference?

Especially since a lot of the versions of the borked file had difference garbage in it.

granolagumbo · Jul 24, 2024

Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.

cbreak · Jul 24, 2024

Rosyna said:
Why are you under the impression there was a NULL dereference?

Especially since a lot of the versions of the borked file had difference garbage in it.

View: https://youtu.be/pCxvyIx922A?t=169

Maarten · Jul 24, 2024

Rosyna said:
Why are you under the impression there was a NULL dereference?

Especially since a lot of the versions of the borked file had difference garbage in it.

At least one of the files had all zeros, and there is a trace out there that shows more details.

View: https://www.youtube.com/watch?v=pCxvyIx922A&t=180s

Rosyna · Jul 24, 2024

bifrost said:
Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.

That’s where strong process isolation comes in. It’d have to be added to Windows first for Microsoft to make a non-kernel EDR API.

Rosyna · Jul 24, 2024

cbreak said:
View: https://youtu.be/pCxvyIx922A?t=169

Yeah, that’s wrong. Only some people got a 291*.*32.sys file full of zeroes. Others just got garbage.

And the reverse engineering of the code that crashed shows that it explicitly checks for NULL before dereferencing.

NetMage · Jul 24, 2024

granolagumbo said:
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization?

No.

cbreak · Jul 24, 2024

Rosyna said:
Yeah, that’s wrong. Only some people got a 291*.*32.sys file full of zeroes. Others just got garbage.

And the reverse engineering of the code that crashed shows that it explicitly checks for NULL before dereferencing.

Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.

cbreak · Jul 24, 2024

Rosyna said:
That’s where strong process isolation comes in. It’d have to be added to Windows first for Microsoft to make a non-kernel EDR API.

Can a normal user really kill system level / other user's processes on windows? That OS is even more garbage than I thought...

CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Legatus Legionis

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Smack-Fu Master, in training

Ars Praetorian

Seniorius Lurkius

Smack-Fu Master, in training

Ars Tribunus Militum

Ars Scholae Palatinae

Seniorius Lurkius

Ars Legatus Legionis

Ars Legatus Legionis

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Seniorius Lurkius

Ars Praefectus

Ars Tribunus Angusticlavius

Seniorius Lurkius

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Tribunus Militum

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Seniorius Lurkius

Ars Praefectus

Ars Tribunus Militum

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Tribunus Angusticlavius

Ars Praefectus

Ars Praefectus

nproxy.org