CrowdStrike blames testing bugs for security update that took down 8.5M Windows PCs

steelcobra

Ars Tribunus Angusticlavius
9,411
Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?

If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.

Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.

How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.
"It's not what you know, it's who you know."

Networking is the name of the game. That's why rich parents fund their kids while they work as unpaid interns to F500 CEOs, so they're meeting people and getting them contacts on boards to become future C-Suiters.
 
Upvote
24 (24 / 0)
Fun little anecdote about Crowdstrikes CEO George Kurtz.

In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]

Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.

So, not his first rodeo of insufficient testing and bad practices in a group he is leading...
Oh god, I was there for that. Where I was working, we stopped the update push from the local relay servers before everything was b0rked. It was still a long day.

Thank goodness the McAfee architecture relied on local relays; an external cloud service shoving the update onto all machines would have caused, well, the same as the CrowdStrike snafu.
 
Upvote
27 (27 / 0)

Xavin

Ars Legatus Legionis
30,578
Subscriptor++
I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
It's actually many more than that, those are just the ones unlucky enough to poll for an update before they realized something was wrong and pulled it.
 
Upvote
9 (9 / 0)

NetMage

Ars Tribunus Angusticlavius
8,028
Subscriptor
I wonder if this same crash would be exploitable as a denial of service attack by crashing the machines or possibly even a Root level code execution? I won't be at all surprised if this is followed up by one or both of those things.
I think by the time you (malware) have enough access to be able to corrupt a channel definition file on a device, you could just crash Windows yourself more easily. The files are not normally accessible.

Of course, that doesn’t mean CrowdStrike channel updates wouldn’t be an awesome supply chain attack target.
 
Upvote
7 (7 / 0)

evan_s

Ars Tribunus Angusticlavius
6,410
Subscriptor
I think by the time you (malware) have enough access to be able to corrupt a channel definition file on a device, you could just crash Windows yourself more easily. The files are not normally accessible.

Of course, that doesn’t mean CrowdStrike channel updates wouldn’t be an awesome supply chain attack target.

One would hope so but given how badly they failed at this I wouldn't bet on it. For all I know they may have set up file associations so you just need to get the file downloaded anywhere and get windows to try to open it.
 
Upvote
-8 (0 / -8)

psko

Smack-Fu Master, in training
66
I assume that .sys with all zeros was zeroed somewhere in the journey to deployment, but: 1) why it was not signed or at least to have a checksum 2) why not tested on real Windows and blessed that this file with certain signature/checksum is OK 3) why CrowdStrike does not check signatures/checksum of each file after downloading the package (even if the whole package is checked, and assuming that they distribute in packages - however, now I read that they download one by one?) 4) why after writing the file they don't check the signature/checksum (to ensure no disk-errors or active malware altering the contents) 5) finally, why kernel driver loading that file does not check the signature/checksum before running/interpreting or whatever it does with it?

ad 5) could that sys be used to inject malicious code to be run in ring-0? This is kinda worrying...... if they can accept zeroed .sys file and interpret it/run it just like that, then probably could be used in a very nasty way?

I get it why they need to use ring-0 driver for it - I've just heard that Microsoft wanted to provide such APIs for AV companies and apparently EU blocked the idea as it could harm competition (YT Dave's Garage). Is EU to be blamed, too? Or maybe MS wanted to charge zillions for having the access to such APIs and smaller AV companies could not effort it? I dunno.
 
Upvote
4 (8 / -4)

joncaplan

Seniorius Lurkius
30
Subscriptor
I would like to know how you propose to run malware detection and analysis from an unprivileged position that's not allowed to do things like arbitrary memory and process and file inspections ?
Good question. Got an interesting answer to that one in a video that was posted by a retired Microsoft engineer, Dave Plummer.
Microsoft had developed a set of APIs to allow this type of inspection from user space processes. He explains that it was nixed by European regulators over concerns about Microsoft using their control over these APIs to harm smaller competitors in the security software space. This starts at about 5:20 in the video
 
Upvote
11 (12 / -1)

Kjella

Ars Tribunus Militum
1,992
I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...) In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone) Because "manual" QA is too slow.
I don't think you need to narrow it down because I think it's all of them. For the last 20 years I've heard nothing but Agile, DevOps, CI/CD, TDD etc. saying any code should be ready to deploy the moment it passes the tests and that anyone following a waterfall style release pipeline with handovers and quality gates is a dinosaur that should be put out to pasture. Preferably in good combination with the belief that you'll get good code from a revolving door of consultants because they're only taking over work that's "done". Oh lordy...
 
Upvote
24 (24 / 0)

Maltz

Ars Scholae Palatinae
1,015
Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.

This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
Pilots are actually trained about "accident chains" and to keep them in mind when making decisions to try to break such chains. Any really big mistake is just the culmination of a trail of little ones. Details matter.
 
Upvote
12 (12 / 0)

joncaplan

Seniorius Lurkius
30
Subscriptor
This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"

"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."

https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html
I'd give a more generous reading of this disclaimer. I believe the intended meaning of not being fault tolerant here is that If a fault occurs, the process or system may go down, which is fine in many domains which don't have critical requirements for responsiveness. However, if I'm developing for or managing a system which requires fault tolerance, such as avionics, a medical device or control system for a chemical or nuclear facility, where faults must be handled without affecting the responsiveness or behavior of the software, I would appreciate a vendor clearly stating that their software component is not fault tolerant, so I could exclude it from consideration.
 
Upvote
22 (22 / 0)

Frosty Grin

Ars Legatus Legionis
18,457
Many of the aircraft investigations I've seen have a strong flavor of "how well the holes in the swiss cheese slices lined up" to them. Somehow, I don't expect that in this case, but it's possible.

In this case, it's all holes, no cheese. That's what's shocking. It would be one thing if they tested the update on real computers and it passed the tests somehow. But they weren't. It would be one thing if their staggered release failed - but they didn't have it.
 
Upvote
12 (12 / 0)

bifrost

Wise, Aged Ars Veteran
190
Good question. Got an interesting answer to that one in a video that was posted by a retired Microsoft engineer, Dave Plummer.
Microsoft had developed a set of APIs to allow this type of inspection from user space processes. He explains that it was nixed by European regulators over concerns about Microsoft using their control over these APIs to harm smaller competitors in the security software space. This starts at about 5:20 in the video
Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.
 
Upvote
0 (3 / -3)
While the escape of the bad config file into the wild was bad, the real problem here is that their application code didn't do any sanity checking on the data before trying to use it. It's bad enough for a modern application to crash because of badly formatted data, it's totally unnacpetable for code running in the Kernel to fail this way. It's made even more maddening when you know that they managed to do the exact same thing to a bunch of RHEL instances a few months ago and this didn't prompt them to take any action!
 
Upvote
20 (20 / 0)

cbreak

Ars Praefectus
5,710
Subscriptor++
As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.

What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.
And that they wrote their crappy code to run in kernel space instead of doing at least the parsing of files and analysis in userland.
And that they obviously didn't even do basic testing / fuzzing of said parser code.
And that they didn't fail safely instead of crashing the whole system in an unrecoverable way.
 
Upvote
12 (13 / -1)

Rosyna

Ars Tribunus Angusticlavius
6,882
So are they saying that they relied on a content validator instead of pushing to an actual system? The fact that this happened to every Windows system it touched is damning.

From the CrowdStrike blog:
Rapid Response Content is delivered as “Template Instances,” which are instantiations of a given Template Type

These “Template Types” sound like binary representations of classes that are saved to disk and then read back into memory to instantiate structures.

Think NSArchiver on macOS.

Yes, I explicitly compared it to NSArchiver because NSArchiver was a massive target of malicious substitution attacks as it did not conform to NSSecureCoding.

Sounds like CrowdStrike might have a similar problem in that they a deserializing code without having proper checks that the serialized version is good?
 
Last edited:
Upvote
-1 (1 / -2)

steelcobra

Ars Tribunus Angusticlavius
9,411
And that they wrote their crappy code to run in kernel space instead of doing at least the parsing of files and analysis in userland.
And that they obviously didn't even do basic testing / fuzzing of said parser code.
And that they didn't fail safely instead of crashing the whole system in an unrecoverable way.
On that last line, it's a general OS safety issue. Windows hard crashes to prevent that kind of activity from causing damage to the system and OS from a bad system call.

And Linux and Mac do it too, they just weren't forced to allow third parties access to the kernel ring.
 
Upvote
4 (5 / -1)

cbreak

Ars Praefectus
5,710
Subscriptor++
No, on this Microsoft is stuck, and beholden to their EU anti-trust agreements from the oughts forcing them to allow 3rd party software to operate like this.
https://www.tomshardware.com/softwa...oid-crowdstrike-like-calamities-in-the-future
Bullshit.

They could easily write some userland API for security software and provide these capabilities safely. Or at least they should try to, who knows if MS still has enough skilled devs to actually pull it off. Apparently they weren't able to, so they wanted to still use kernel level access themselves. And THAT is the problem.
 
Upvote
-4 (7 / -11)

steelcobra

Ars Tribunus Angusticlavius
9,411
Bullshit.

They could easily write some userland API for security software and provide these capabilities safely. Or at least they should try to, who knows if MS still has enough skilled devs to actually pull it off. Apparently they weren't able to, so they wanted to still use kernel level access themselves. And THAT is the problem.
Did you miss that this was an EU-mandated requirement, not a choice they were given?
 
Upvote
-1 (7 / -8)

cbreak

Ars Praefectus
5,710
Subscriptor++
On that last line, it's a general OS safety issue. Windows hard crashes to prevent that kind of activity from causing damage to the system and OS from a bad system call.

And Linux and Mac do it too, they just weren't forced to allow third parties access to the kernel ring.
Neither is windows. Either microsoft or crowdstrike could have easily provided a userland API (via a minimal kernel module), and used that in userland to do all the heavy lifting. No one forced crowdstrike to parse untrusted and potentially (and obviously actually) malformed data in kernel space.
 
Upvote
13 (15 / -2)

NetMage

Ars Tribunus Angusticlavius
8,028
Subscriptor
These “Template Types” sound like binary representations of classes that are saved to disk and then read back into memory to instantiate structures.
No, the template types are pre-defined functions that do certain types of scanning or logging and they are sent out and updated with full CrowdStrike code updates, which obeyed e.g. N-1 update delays. The channel definition file contains parameters to be passed to execute the procedures that were included the previous code updates.

Which means the N-1 was meaningless for template code, since it’s execution was controlled by the channel definition sent later. It will most likely not be executed for a while until even delayed users have the untested code ready to be executed (!).
 
Upvote
-1 (1 / -2)

cbreak

Ars Praefectus
5,710
Subscriptor++
Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.
That would mean the malware already has root-equivalent permissions, so it could also inject code into the kernel and kill it that way (for example by replacing the channel files used by crowdstrike with even more malicious ones).
And it'd be a gigantic red flag for the kernel side component of the detector that something's up. That's not what a stealthy malware would want to do.
 
Upvote
2 (3 / -1)

Great_Scott

Ars Tribunus Militum
2,189
Subscriptor
So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.
I find the Crowdstrike admission odd.

How can there be testing bugs when they aren't doing any testing?
 
Upvote
11 (12 / -1)

Rosyna

Ars Tribunus Angusticlavius
6,882
It seems that they still have an unchecked NULL pointer dereference in the main code. Even with the faulty "content configuration update" that code should not crash.
Why are you under the impression there was a NULL dereference?

Especially since a lot of the versions of the borked file had difference garbage in it.
 
Upvote
1 (2 / -1)
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization? We used to do that with any of the Windows updates and had them cause problems on a few occasions.
 
Upvote
-3 (5 / -8)

Maarten

Ars Tribunus Militum
1,831
Subscriptor++
Upvote
6 (7 / -1)

Rosyna

Ars Tribunus Angusticlavius
6,882
Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.
That’s where strong process isolation comes in. It’d have to be added to Windows first for Microsoft to make a non-kernel EDR API.
 
Upvote
4 (4 / 0)

NetMage

Ars Tribunus Angusticlavius
8,028
Subscriptor
Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization?
No.
 
Upvote
10 (10 / 0)

cbreak

Ars Praefectus
5,710
Subscriptor++
Yeah, that’s wrong. Only some people got a 291*.*32.sys file full of zeroes. Others just got garbage.

And the reverse engineering of the code that crashed shows that it explicitly checks for NULL before dereferencing.
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.
 
Upvote
4 (5 / -1)