CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

limeos

Wise, Aged Ars Veteran
190
Move fast and break things! Got to be agile!
First Agile doesn't really mean breaking your application. Second, if you break something then you break it early during development and testing, not in production.

This sounds more like a QA screwup due to inadequate processes or insufficient man-power
 
Upvote
5 (5 / 0)

MagicDot

Ars Scholae Palatinae
878
Subscriptor
First Agile doesn't really mean breaking your application. Second, if you break something then you break it early during development and testing, not in production.

This sounds more like a QA screwup due to inadequate processes or insufficient man-power

Yes and no. Agile has definitely allowed a sloppiness to encroach into software development and the constant push for quick releases always comes at the cost of quality...ALWAYS. Although Agile has borrowed some good aspects from JAD, I do think it has generally spread as a brain fever. In my experience (25+ years) it generates far more bugs than software developed using something like waterfall. I tell my interns to view Agile with a critical eye because some day it will be the punchline of engineer jokes.
 
Upvote
9 (9 / 0)

Stern

Ars Praefectus
3,749
Subscriptor++
Agile has definitely allowed a sloppiness to encroach into software development and the constant push for quick releases always comes at the cost of quality...ALWAYS.
That is not true. I've worked for one company where the switch to agile methods brought massive quality improvements through a combination of test automation and shorter, incremental development cycles. The focus was also on a steady release cadence rather than speed. I've done waterfall too, and that company had far more issues in their code.
 
Upvote
2 (4 / -2)

meisanerd

Ars Centurion
1,041
Subscriptor
I guess some companies will pivot to an A/B strategy where half of the endpoints get CrowdStrike protection and the other half gets an alternative.

This will of course increase the system administration bills but you get resilience in return.
This kinda defeats the purpose of XDR, though. XDR has its main power in being able to aggregate activities across all of your devices, so you can have your router go "hey, something is talking on this suspicious port", and trace it back to computer X and have it go "well, I am running program Y, which is also doing these suspicious activities" and kill the software. Or determine, "wait, that software is new, but good, so let it continue to use that port". As soon as you are splitting your security software stack apart, you lose visibility across your entire network. So you get resilience against these very infrequent issues (and most of these companies would still be down for a while even if only have their computers were BSOD), at the expense of actual security.

Running a basic home network or small office, yah, you can get away with a random mismash of software, but when you are dealing with thousands of endpoints in an enterprise situation, you want everything talking to the main C&C setup, with as little deviance as possible. That way, sysadmins know how the system runs, and aren't dealing with "well, Crowdstrike does things this way, Bitdefender does things that way", which can also cause issues if they forget to do something in CS that BD does automatically...
 
Upvote
2 (3 / -1)

iljitsch

Ars Tribunus Angusticlavius
9,000
Subscriptor++
Dave Plummer, past Microsoft engineer, goes over the Crowdstrike incident
[...]
One interesting bit is he says the "content updates" include p-code that the Crowdstrike driver executes, and that the driver is set up to essentially run unsigned code from these update files at the kernel level. That's actually pretty scary stuff.
I came away from that video completely shocked. They made so many enormously risky decisions:
  • They have a device driver running inside the kernel. I guess that one is on Microsoft for not providing a better way to get the type of access this stuff needs.
  • And: the driver is flagged as required to boot the machine. Which means that if that driver doesn't work, you almost certainly can't get booted up far enough to download new updates that might solve any issues. Now this may be needed to properly lock down everything, but again, I'm sure Microsoft could have provided a safer method to get the same results if that is the case.
  • That device driver in turn runs unsigned code pushed out by Crowdstrike. Still inside the kernel.
  • Without any sanity checking, so an all-zero file is accepted.
  • They apparently just jump to pointers they find in those push updates. So if someone ever manages to insert a maliciously crafted one they can literally do EVERYTHING on that PC that is not protected by a secure enclave.
And then of course they pushed out a corrupt update. I'm interested to learn how that happened. I'm thinking the most obvious way is that the update was generated as normal and cleared automatic testing and then was overwritten by all zeros.

This is just so bad on so many levels.

Yesterday I heard someone talk about a way to fix this remotely by having these PCs boot off of the network and then swing dead chickens and chant incantations just right. Any more info on this?

It would probably make sense to just make all of these PCs network boot always. That way relief is simply a server side change and a reboot away.
 
Upvote
5 (7 / -2)
What I don't get is given how broadly applicable the problems were, how in the world did this pass internal testing at CrowdStrike?

I could understand if the problem was extremely rare, but this seems like it basially hit everyone who downloaded the update.
That assumes a lot about the quality of (or existence of) their internal testing.
 
Upvote
3 (3 / 0)

akw0088

Wise, Aged Ars Veteran
167
Here I thought this was some obscure news story that while interesting and widespread didn't affect me personally.

Come into work Monday to my laptop that was locked and "Page Fault in Non-Paged Area" and no one is around. Thought we had a layoff or something for a bit until I realized everyone was at IT getting their laptops fixed
 
Upvote
9 (9 / 0)

stormcrash

Ars Tribunus Angusticlavius
8,968
First Agile doesn't really mean breaking your application. Second, if you break something then you break it early during development and testing, not in production.

This sounds more like a QA screwup due to inadequate processes or insufficient man-power
In theory agile doesn't mean that, in practice it often does lead to "eff it, push it to prod to get the work item out of my sprint, it passed x unit tests or worked on my dev box"
 
Upvote
5 (5 / 0)

Maarten

Ars Tribunus Militum
1,831
Subscriptor++
What I don't get is given how broadly applicable the problems were, how in the world did this pass internal testing at CrowdStrike?

I could understand if the problem was extremely rare, but this seems like it basially hit everyone who downloaded the update.
According to this article a few months ago the CrowdStrike software for Linux also caused issues with an update. It turns out that some supposedly supported linux configurations (Debian, Rocky Linux, and therefore likely Redhat was well) are not part of the testing matrix at CrowdStrike. I find it hard to believe that some configurations of Windows – and apparently fairly common ones at that – would not be part of the testing, but at least there is precedent for incomplete testing.
 
Upvote
3 (3 / 0)

meisanerd

Ars Centurion
1,041
Subscriptor
I suspect the file got corrupted in transfer after it was tested.
You mean between testing and deployment servers internally at Crowdstrike? Because there is no way it got corrupted upon download by thousands of systems worldwide in the same way. And even then, everything should be checked against a checksum. Even internally, so they can make sure the same file they are testing is the same file they are deploying.
 
Upvote
11 (11 / 0)

Got Nate?

Ars Scholae Palatinae
1,231
If apple would need deep access beyond their API, then that just means their API is insufficient and needs improvement.
Apple quite frequently goes around the published APIs thus failing to dogfood the published APIs, leaving gaping holes that need improvement. They also quite frequently DO dogfood their published APIs, and those usually turn out pretty great. And then there's Swift-UI.
Ok, I was going to be snarky (as this has been possible for decades). But: is this really true? Certainly this has been getting harder and harder over recent years. But AFAIK it's still possible to install kernel extensions in MacOS. Or at least that was still the case fairly recently.
Still possible, but it requires 3 reboots (among other things) along the way
 
Upvote
1 (1 / 0)

iljitsch

Ars Tribunus Angusticlavius
9,000
Subscriptor++
Apple quite frequently goes around the published APIs thus failing to dogfood the published APIs, leaving gaping holes that need improvement. They also quite frequently DO dogfood their published APIs, and those usually turn out pretty great. And then there's Swift-UI.

Still possible, but it requires 3 reboots (among other things) along the way
Some time ago Apple managed to push out an update that killed Ethernet interfaces: https://www.digitaltrends.com/computing/mac-update-breaks-ethernet-fix/

So good luck fixing that breakage.

Nobody is perfect. Even though they are worth billions and have no excuses for this level of fail.

I do not trust Apple very much. But they're still one or more orders of magnitude better than Crowdstrike.
 
Upvote
3 (3 / 0)

Ed1024

Ars Scholae Palatinae
811
Subscriptor++
I was just thinking with a glass of wine in my hand (the best kind of thinking) and had an “uh oh” moment.

Like the poster above, it seems like the parser they’ve stuck in the kernel is not really much of a parser, more an executor. Why this hasn’t been abused before IDK, but maybe it has...?

Also, over the next couple of weeks most people working in medium to large companies or floating about in private and public spaces are going to be witness to PCs all over the place being physically accessed to fix the blue-screening. How many of these “engineers” are going to be hackers with USB sticks? It’s probably one of the largest windows for illicit action that has a good chance of going undetected that we’ve ever seen.
 
Upvote
0 (1 / -1)
The over-dependency on enterprise "cybersecurity" providers is inherently dangerous, exactly like this event shows. I realize it checks the liability boxes so senior execs can say they performed diligence and get their massive pay regardless of how horrific a cock-up happens on their watch, but from a system design point of view, it's questionable. At best.

It's not especially a problem with security software. It's that the trend is for third parties to have hooks deep in the critical path of your organisation, with little to no oversight or business continuity available. This is presented as a win-win - the vendor wins a subscription revenue model that is almost impossible for the customer to exit, and the organisation wins being able to get rid of their above-minimum-wage staff because the SaaS manages everything for them.

The problem is that there's no fallback for the organisation when things go wrong.
 
Upvote
4 (4 / 0)

steelcobra

Ars Tribunus Angusticlavius
9,386

View: https://youtu.be/wAzEJxOo1ts?si=g2ED-tz8hmXhBTM0


This video explains it pretty well. Crowdstrike runs as a boot-loaded kernel mode driver where the "driver' itself is signed, but they push code updates to it that it runs in kernel mode. This particular update was making bad system calls referencing an empty table, which of course causes Windows to stop because something that's not supposed to happen is happening every time CS' software starts running.

And then you have to manually load every affected machine into safe mode and delete a specific file.

To make it worse, according to a commentator CS has the ability to tag updates so they skip deployment staging policies, so instead of this only hitting his test group, then non-critical, then all, it went live on all users.
 
Upvote
7 (7 / 0)

doubleyewdee

Ars Scholae Palatinae
786
Subscriptor++
Those systems would still boot.

[mike drop]

If a system boots and isn't reachable on a network or is blocking any services from starting because it incorrectly thinks they're malware, requiring physical access to restore functionality, how is that actually better?

Also, be kinder to Mike next time. He's a soft boy.
 
Upvote
4 (4 / 0)

mohnish82

Wise, Aged Ars Veteran
180
It's a universal problem, not limited to Microsoft. A bad kernel module will kill a Linux install as well. A driver, pretty much by definition, has to run with kernel level privileges; and at that level, a mistake in the code cannot be trapped - it's going to bring the system down.

Some things that are currently kernel modules can be moved to userspace - but some things cannot. (And doing so does bring certain tradeoffs - for example, GPU drivers can be in userspace, but there is a performance hit in doing so. Given how complex GPU drivers are these days, that's a worthwhile tradeoff IMO; but it is a tradeoff.)

CrowdStrike made the choice - rightly or wrongly - to implement their code as a kernel level driver. Their code caused these crashes. Ergo, CrowdStrike is wholly to blame for this. Microsoft might be able to implement improvements that allow more stuff that's currently in the kernel to move into userspace - but that's a separate issue.

If you write stuff that runs with kernel-level privileges, it is on you to make sure your code is robust. There is only so much that the OS vendor can do to limit the damage in that scenario.
Microsft should permit customers to mark a driver - boot start: false, if they choose so. Allowing manual override over (CrowdStrike etc's) boot start: true hard-code.
 
Upvote
-2 (0 / -2)

steelcobra

Ars Tribunus Angusticlavius
9,386
Microsft should permit customers to mark a driver - boot start: false, if they choose so. Allowing manual override over (CrowdStrike etc's) boot start: true hard-code.
It's enterprise security software, it's deployed en masse and designed so users can't mess with it. In theory it's designed so that sysadmins can designate deployment rings so this kind of thing only hits a small test group. Instead Crowdstrike hid that they can tell a patch to deploy to all anyways. Nobody knew they might have to do something like that.
 
Upvote
5 (5 / 0)

mohnish82

Wise, Aged Ars Veteran
180
It's enterprise security software, it's deployed en masse and designed so users can't mess with it. In theory it's designed so that sysadmins can designate deployment rings so this kind of thing only hits a small test group. Instead Crowdstrike hid that they can tell a patch to deploy to all anyways. Nobody knew they might have to do something like that.

.. users can mess with it .. is something you assumed and went on with. Giving an option, of course, means giving it to the authorized user.

I'm not discounting Crowdstrike's hand in this. But, the point being that there would always be a company like Crowstrike. Microsoft needs to step up its game & think about a good solution.

E.g. iOS could have also said that web tracking via the browsers is unavoidable and done nothing, or passed the blame down to someone who implements intrusive tracking. But they tried to do something about it, by forcing the use of webkit on their system to put it in check.

Note: .. Ars community - please don't start on .. Apple did this .. only to .. because ... their etc. etc. That's not the point of the example.
 
Upvote
-4 (0 / -4)

SparkE

Wise, Aged Ars Veteran
117
It's hard to top this one, lol.
Have you tried turning it off and on and off and on and off and on and off and on and off and on and off and on and off and on and off and on and off and on and off and on again?
Apparently doing the same thing over and over again and expecting different results is no longer the height of insanity. 🤔
 
Upvote
6 (6 / 0)

sjl

Ars Tribunus Militum
2,724
Giving an option, of course, means giving it to the authorized user.
You're thinking on a small, individual user scale - not an enterprise scale. The rules and requirements change radically when there are hundreds or thousands of systems to manage. Giving the end users the ability to turn off security critical software is exactly the sort of thing that cyber security teams are dead set against. It means you end up with non-compliant systems, and in some industries, that's an absolute no. Even in industries where it's not an absolute no, it's still something that the IT team would be extremely hesitant about.

I've done my time as a systems administrator. I've had full blown super user privileges to mission critical systems. And yet my current employer has my laptop so locked down, there's a bunch of stuff that I have to go to corporate IT to get done, despite having the knowledge and skills to do it myself. And bluntly, whilst it is a hassle for me at times, I get it, and I can't really argue against it - because I don't really need that level of access to do my job on a day to day basis. Removing that level of access reduces the potential attack surface significantly. Same deal with not allowing users to disable security software.
 
Upvote
5 (5 / 0)