"It's not what you know, it's who you know."Someone help me out here. What is the exact dollar-value salary range where you start failing upwards?
If my one-man business was this negligent, I'd never work again. But if you get paid a fortune to run a global corp, when you fail spectacularly through sheer incompetence you just get moved to another c-suite job at a different company, and do it again. Repeat until you retire.
Look at the resumes of half these CEOs etc... and it's a trail of failure. But it's never them that suffers for it. In any sane world the consequence of failure should be higher if you get paid millions because you're supposed to be so special and important.
How much do you have to be paid before everyone suddenly decides that you don't face consequences anymore? Just curious.
Oh god, I was there for that. Where I was working, we stopped the update push from the local relay servers before everything was b0rked. It was still a long day.Fun little anecdote about Crowdstrikes CEO George Kurtz.
In October 2009, McAfee promoted him to chief technology officer and executive vice president.[13] Six months later, McAfee accidentally disrupted its customers' operations around the world when it pushed out a software update that deleted critical Windows XP system files and caused affected systems to bluescreen and enter a boot loop. "I'm not sure any virus writer has ever developed a piece of malware that shut down as many machines as quickly as McAfee did today," Ed Bott wrote at ZDNet.[6]
Pulled from the Wiki article about him, but verified through ZDNet article and a couple of others I poked at.
So, not his first rodeo of insufficient testing and bad practices in a group he is leading...
It's actually many more than that, those are just the ones unlucky enough to poll for an update before they realized something was wrong and pulled it.I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
I think by the time you (malware) have enough access to be able to corrupt a channel definition file on a device, you could just crash Windows yourself more easily. The files are not normally accessible.I wonder if this same crash would be exploitable as a denial of service attack by crashing the machines or possibly even a Root level code execution? I won't be at all surprised if this is followed up by one or both of those things.
Knee slapper"full Root Cause Analysis". Well, one of the problems was that their software is running as root.
I think by the time you (malware) have enough access to be able to corrupt a channel definition file on a device, you could just crash Windows yourself more easily. The files are not normally accessible.
Of course, that doesn’t mean CrowdStrike channel updates wouldn’t be an awesome supply chain attack target.
It's a fixed cost that bean counters turned into a variable cost, thanks to Agile. And we all know how the variance goes, as close to zero as possible.QA/QC doesn't return tangible value on the quarterly report so it must be a waste.
Good question. Got an interesting answer to that one in a video that was posted by a retired Microsoft engineer, Dave Plummer.I would like to know how you propose to run malware detection and analysis from an unprivileged position that's not allowed to do things like arbitrary memory and process and file inspections ?
I don't think you need to narrow it down because I think it's all of them. For the last 20 years I've heard nothing but Agile, DevOps, CI/CD, TDD etc. saying any code should be ready to deploy the moment it passes the tests and that anyone following a waterfall style release pipeline with handovers and quality gates is a dinosaur that should be put out to pasture. Preferably in good combination with the belief that you'll get good code from a revolving door of consultants because they're only taking over work that's "done". Oh lordy...I worked at a large "northern European" HQ'ed telecoms company: (not saying which one...) In several of their divisions one of their big pushes is for every single test/QA person to be capable of writing test automation code, and if you're not capable of writing test automation code, you're on the layoff list. (or already gone) Because "manual" QA is too slow.
Pilots are actually trained about "accident chains" and to keep them in mind when making decisions to try to break such chains. Any really big mistake is just the culmination of a trail of little ones. Details matter.Typically in a case like this, the root cause analysis is kind of crazy to read. Obviously, the company is at fault, but the accident chain is always a pretty wild series of coincidences that always leaves me with some degree of feeling "there but for the grace of God go I." The most recent Microsoft hack was a result of this ultra-improbable cascade of a crash dump in the super-secure system ending up on a less-secure dev's machine for debugging, an obscure bug that didn't redact data in the crash dump, the dev's machine's being compromised, and the attackers just getting ridiculously lucky as they were combing through it, etc. Airline disasters have the same tone to them; it's always a chain of improbable stuff, even if there were serious mistakes along the way.
This one is just . . . they didn't stage the rollout and didn't, like, test it? WTF? I think I'm more careful deploying updates to "critical" systems on my home network than these guys are with updates to 8.5 million machines.
I'd give a more generous reading of this disclaimer. I believe the intended meaning of not being fault tolerant here is that If a fault occurs, the process or system may go down, which is fine in many domains which don't have critical requirements for responsiveness. However, if I'm developing for or managing a system which requires fault tolerance, such as avionics, a medical device or control system for a chemical or nuclear facility, where faults must be handled without affecting the responsiveness or behavior of the software, I would appreciate a vendor clearly stating that their software component is not fault tolerant, so I could exclude it from consideration.This very interesting post goes into the terms of service and basically concludes that "we told you not to use this software on critical systems and if you did, it's on you"
"THE OFFERINGS AND CROWDSTRIKE TOOLS ARE NOT FAULT-TOLERANT AND ARE NOT DESIGNED OR INTENDED FOR USE IN ANY HAZARDOUS ENVIRONMENT REQUIRING FAIL-SAFE PERFORMANCE OR OPERATION."
https://www.hackerfactor.com/blog/index.php?/archives/1038-When-the-Crowd-Strikes-Back.html
Many of the aircraft investigations I've seen have a strong flavor of "how well the holes in the swiss cheese slices lined up" to them. Somehow, I don't expect that in this case, but it's possible.
Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.Good question. Got an interesting answer to that one in a video that was posted by a retired Microsoft engineer, Dave Plummer.
Microsoft had developed a set of APIs to allow this type of inspection from user space processes. He explains that it was nixed by European regulators over concerns about Microsoft using their control over these APIs to harm smaller competitors in the security software space. This starts at about 5:20 in the video
Yes , that was the point I was making .From the response:
They should have been doing this already. This is what the lawsuits should hinge on.
And that they wrote their crappy code to run in kernel space instead of doing at least the parsing of files and analysis in userland.As has been previously addressed, maybe they shouldn't be running certain parts in Ring 0, but AV software necessarily has to run with very high privileges to do the job. It's an unfortunately necessary evil.
What's indefensible is that their CI/CD mechanisms were so shoddy that they didn't catch this.
So are they saying that they relied on a content validator instead of pushing to an actual system? The fact that this happened to every Windows system it touched is damning.
Rapid Response Content is delivered as “Template Instances,” which are instantiations of a given Template Type
On that last line, it's a general OS safety issue. Windows hard crashes to prevent that kind of activity from causing damage to the system and OS from a bad system call.And that they wrote their crappy code to run in kernel space instead of doing at least the parsing of files and analysis in userland.
And that they obviously didn't even do basic testing / fuzzing of said parser code.
And that they didn't fail safely instead of crashing the whole system in an unrecoverable way.
Bullshit.No, on this Microsoft is stuck, and beholden to their EU anti-trust agreements from the oughts forcing them to allow 3rd party software to operate like this.
https://www.tomshardware.com/softwa...oid-crowdstrike-like-calamities-in-the-future
Did you miss that this was an EU-mandated requirement, not a choice they were given?Bullshit.
They could easily write some userland API for security software and provide these capabilities safely. Or at least they should try to, who knows if MS still has enough skilled devs to actually pull it off. Apparently they weren't able to, so they wanted to still use kernel level access themselves. And THAT is the problem.
Neither is windows. Either microsoft or crowdstrike could have easily provided a userland API (via a minimal kernel module), and used that in userland to do all the heavy lifting. No one forced crowdstrike to parse untrusted and potentially (and obviously actually) malformed data in kernel space.On that last line, it's a general OS safety issue. Windows hard crashes to prevent that kind of activity from causing damage to the system and OS from a bad system call.
And Linux and Mac do it too, they just weren't forced to allow third parties access to the kernel ring.
It was absolutely a choice. And they chose poorly. Try reading next time.Did you miss that this was an EU-mandated requirement, not a choice they were given?
No, the template types are pre-defined functions that do certain types of scanning or logging and they are sent out and updated with full CrowdStrike code updates, which obeyed e.g. N-1 update delays. The channel definition file contains parameters to be passed to execute the procedures that were included the previous code updates.These “Template Types” sound like binary representations of classes that are saved to disk and then read back into memory to instantiate structures.
That would mean the malware already has root-equivalent permissions, so it could also inject code into the kernel and kill it that way (for example by replacing the channel files used by crowdstrike with even more malicious ones).Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.
I find the Crowdstrike admission odd.So they admit they deployed worldwide all at once. That has been against best practices for large scale deployments for more than two decades. Sue them into oblivion.
Why are you under the impression there was a NULL dereference?It seems that they still have an unchecked NULL pointer dereference in the main code. Even with the faulty "content configuration update" that code should not crash.
Why are you under the impression there was a NULL dereference?
Especially since a lot of the versions of the borked file had difference garbage in it.
At least one of the files had all zeros, and there is a trace out there that shows more details.Why are you under the impression there was a NULL dereference?
Especially since a lot of the versions of the borked file had difference garbage in it.
That’s where strong process isolation comes in. It’d have to be added to Windows first for Microsoft to make a non-kernel EDR API.Those user-space processes would be searched for and killed by the malware installer script before it downloads and runs the main payload.
No.Please pardon my complete ignorance.... and I don't want to blame the victims, but was there a way for organizations using the software to test the update on non-mission critical machines before allowing the update to be applied to the entire organization?
Well, obviously it didn't check correctly, because the debugger clearly shows that a pointer in the null page is attempted to be dereferenced. Either that, or the debugger is lying.Yeah, that’s wrong. Only some people got a 291*.*32.sys file full of zeroes. Others just got garbage.
And the reverse engineering of the code that crashed shows that it explicitly checks for NULL before dereferencing.
Can a normal user really kill system level / other user's processes on windows? That OS is even more garbage than I thought...That’s where strong process isolation comes in. It’d have to be added to Windows first for Microsoft to make a non-kernel EDR API.