CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Power_Struggle · Jul 22, 2024

I guess some companies will pivot to an A/B strategy where half of the endpoints get CrowdStrike protection and the other half gets an alternative.

This will of course increase the system administration bills but you get resilience in return.

limeos · Jul 22, 2024

KingStudMuffin said:
Move fast and break things! Got to be agile!

First Agile doesn't really mean breaking your application. Second, if you break something then you break it early during development and testing, not in production.

This sounds more like a QA screwup due to inadequate processes or insufficient man-power

marsilies · Jul 22, 2024

Chuckstar said:
He doesn’t say the updates include p-code. He just says that would be possible to do and would be bad.

Ah, on rewatch, I apparently missed him saying "let's speculate" right before that bit. Thanks for the correction.

MagicDot · Jul 22, 2024

Luckily AI will soon make all of these technical problems a relic of the past.
/s

MagicDot · Jul 22, 2024

limeos said:
First Agile doesn't really mean breaking your application. Second, if you break something then you break it early during development and testing, not in production.

This sounds more like a QA screwup due to inadequate processes or insufficient man-power

Yes and no. Agile has definitely allowed a sloppiness to encroach into software development and the constant push for quick releases always comes at the cost of quality...ALWAYS. Although Agile has borrowed some good aspects from JAD, I do think it has generally spread as a brain fever. In my experience (25+ years) it generates far more bugs than software developed using something like waterfall. I tell my interns to view Agile with a critical eye because some day it will be the punchline of engineer jokes.

Malmesbury · Jul 22, 2024

Shiunbird said:
It seems that he was McAfee CEO back in 2010.

John McAfee was right about McAfee software…

MsSuperPartyWonderFunDay · Jul 22, 2024

UserIDAlreadyInUse said:
I dunno, they seem to be living up to their name here already; they struck the crowd, alright, open-handed right across the boot process.

Without consent!

Cryxx · Jul 22, 2024

UserIDAlreadyInUse said:
I dunno, they seem to be living up to their name here already; they struck the crowd, alright, open-handed right across the boot process.

The one day you decide to move your product from an Defensive Security tool to an offensive. It only takes 1 broken line of code.

Deny_Deflect_Disavow · Jul 22, 2024

Boot Windows into Safe Mode or the Windows Recovery Environment
Navigate to the C $:\$ Windows\System32\drivers\CrowdStrike directory.
Locate the file “C-00000291*.sys”, and delete it.
Boot the host normally.

courtesy of Spiceworks.com

Stern · Jul 22, 2024

MagicDot said:
Agile has definitely allowed a sloppiness to encroach into software development and the constant push for quick releases always comes at the cost of quality...ALWAYS.

That is not true. I've worked for one company where the switch to agile methods brought massive quality improvements through a combination of test automation and shorter, incremental development cycles. The focus was also on a steady release cadence rather than speed. I've done waterfall too, and that company had far more issues in their code.

ranthog · Jul 22, 2024

They say that everyone has a test environment, but not everyone has a production environment.

Hopefully this will help IT convince management to have a proper test and production environment, as well as staged roll outs.

meisanerd · Jul 22, 2024

Power_Struggle said:
I guess some companies will pivot to an A/B strategy where half of the endpoints get CrowdStrike protection and the other half gets an alternative.

This will of course increase the system administration bills but you get resilience in return.

This kinda defeats the purpose of XDR, though. XDR has its main power in being able to aggregate activities across all of your devices, so you can have your router go "hey, something is talking on this suspicious port", and trace it back to computer X and have it go "well, I am running program Y, which is also doing these suspicious activities" and kill the software. Or determine, "wait, that software is new, but good, so let it continue to use that port". As soon as you are splitting your security software stack apart, you lose visibility across your entire network. So you get resilience against these very infrequent issues (and most of these companies would still be down for a while even if only have their computers were BSOD), at the expense of actual security.

Running a basic home network or small office, yah, you can get away with a random mismash of software, but when you are dealing with thousands of endpoints in an enterprise situation, you want everything talking to the main C&C setup, with as little deviance as possible. That way, sysadmins know how the system runs, and aren't dealing with "well, Crowdstrike does things this way, Bitdefender does things that way", which can also cause issues if they forget to do something in CS that BD does automatically...

iljitsch · Jul 22, 2024

marsilies said:
Dave Plummer, past Microsoft engineer, goes over the Crowdstrike incident
[...]
One interesting bit is he says the "content updates" include p-code that the Crowdstrike driver executes, and that the driver is set up to essentially run unsigned code from these update files at the kernel level. That's actually pretty scary stuff.

I came away from that video completely shocked. They made so many enormously risky decisions:

They have a device driver running inside the kernel. I guess that one is on Microsoft for not providing a better way to get the type of access this stuff needs.
And: the driver is flagged as required to boot the machine. Which means that if that driver doesn't work, you almost certainly can't get booted up far enough to download new updates that might solve any issues. Now this may be needed to properly lock down everything, but again, I'm sure Microsoft could have provided a safer method to get the same results if that is the case.
That device driver in turn runs unsigned code pushed out by Crowdstrike. Still inside the kernel.
Without any sanity checking, so an all-zero file is accepted.
They apparently just jump to pointers they find in those push updates. So if someone ever manages to insert a maliciously crafted one they can literally do EVERYTHING on that PC that is not protected by a secure enclave.

And then of course they pushed out a corrupt update. I'm interested to learn how that happened. I'm thinking the most obvious way is that the update was generated as normal and cleared automatic testing and then was overwritten by all zeros.

This is just so bad on so many levels.

Yesterday I heard someone talk about a way to fix this remotely by having these PCs boot off of the network and then swing dead chickens and chant incantations just right. Any more info on this?

It would probably make sense to just make all of these PCs network boot always. That way relief is simply a server side change and a reboot away.

Theriverlethe · Jul 22, 2024

gaballard said:
This is why you never deploy on Fridays.

It sounds like this is an almost-daily type of update, similar to antivirus definitions.

mikell · Jul 22, 2024

Our Labs got hit hard because the computers are always on. It'll take a couple of weeks to get it straightened out.

ranthog · Jul 22, 2024

What I don't get is given how broadly applicable the problems were, how in the world did this pass internal testing at CrowdStrike?

I could understand if the problem was extremely rare, but this seems like it basially hit everyone who downloaded the update.

Derecho Imminent · Jul 22, 2024

https://www.extremetech.com/computing/microsoft-releases-crowdstrike-issue-recovery-tool

Celery Man · Jul 22, 2024

ranthog said:
What I don't get is given how broadly applicable the problems were, how in the world did this pass internal testing at CrowdStrike?

I could understand if the problem was extremely rare, but this seems like it basially hit everyone who downloaded the update.

That assumes a lot about the quality of (or existence of) their internal testing.

akw0088 · Jul 22, 2024

Here I thought this was some obscure news story that while interesting and widespread didn't affect me personally.

Come into work Monday to my laptop that was locked and "Page Fault in Non-Paged Area" and no one is around. Thought we had a layoff or something for a bit until I realized everyone was at IT getting their laptops fixed

stormcrash · Jul 22, 2024

limeos said:
First Agile doesn't really mean breaking your application. Second, if you break something then you break it early during development and testing, not in production.

This sounds more like a QA screwup due to inadequate processes or insufficient man-power

In theory agile doesn't mean that, in practice it often does lead to "eff it, push it to prod to get the work item out of my sprint, it passed x unit tests or worked on my dev box"

Maarten · Jul 22, 2024

ranthog said:
What I don't get is given how broadly applicable the problems were, how in the world did this pass internal testing at CrowdStrike?

I could understand if the problem was extremely rare, but this seems like it basially hit everyone who downloaded the update.

According to this article a few months ago the CrowdStrike software for Linux also caused issues with an update. It turns out that some supposedly supported linux configurations (Debian, Rocky Linux, and therefore likely Redhat was well) are not part of the testing matrix at CrowdStrike. I find it hard to believe that some configurations of Windows – and apparently fairly common ones at that – would not be part of the testing, but at least there is precedent for incomplete testing.

Derecho Imminent · Jul 22, 2024

ranthog said:
What I don't get is given how broadly applicable the problems were, how in the world did this pass internal testing at CrowdStrike?

I suspect the file got corrupted in transfer after it was tested.

meisanerd · Jul 22, 2024

adamrussell said:
I suspect the file got corrupted in transfer after it was tested.

You mean between testing and deployment servers internally at Crowdstrike? Because there is no way it got corrupted upon download by thousands of systems worldwide in the same way. And even then, everything should be checked against a checksum. Even internally, so they can make sure the same file they are testing is the same file they are deploying.

stormcrash · Jul 22, 2024

adamrussell said:
I suspect the file got corrupted in transfer after it was tested.

That would mean they have no checksumming or file integrity on their builds which would be an even more damning implication than "they made a bad build get pushed"

Got Nate? · Jul 22, 2024

cbreak said:
If apple would need deep access beyond their API, then that just means their API is insufficient and needs improvement.

Apple quite frequently goes around the published APIs thus failing to dogfood the published APIs, leaving gaping holes that need improvement. They also quite frequently DO dogfood their published APIs, and those usually turn out pretty great. And then there's Swift-UI.

iljitsch said:
Ok, I was going to be snarky (as this has been possible for decades). But: is this really true? Certainly this has been getting harder and harder over recent years. But AFAIK it's still possible to install kernel extensions in MacOS. Or at least that was still the case fairly recently.

Still possible, but it requires 3 reboots (among other things) along the way

iljitsch · Jul 22, 2024

Got Nate? said:
Apple quite frequently goes around the published APIs thus failing to dogfood the published APIs, leaving gaping holes that need improvement. They also quite frequently DO dogfood their published APIs, and those usually turn out pretty great. And then there's Swift-UI.

Still possible, but it requires 3 reboots (among other things) along the way

Some time ago Apple managed to push out an update that killed Ethernet interfaces: https://www.digitaltrends.com/computing/mac-update-breaks-ethernet-fix/

So good luck fixing that breakage.

Nobody is perfect. Even though they are worth billions and have no excuses for this level of fail.

I do not trust Apple very much. But they're still one or more orders of magnitude better than Crowdstrike.

Ed1024 · Jul 22, 2024

I was just thinking with a glass of wine in my hand (the best kind of thinking) and had an “uh oh” moment.

Like the poster above, it seems like the parser they’ve stuck in the kernel is not really much of a parser, more an executor. Why this hasn’t been abused before IDK, but maybe it has...?

Also, over the next couple of weeks most people working in medium to large companies or floating about in private and public spaces are going to be witness to PCs all over the place being physically accessed to fix the blue-screening. How many of these “engineers” are going to be hackers with USB sticks? It’s probably one of the largest windows for illicit action that has a good chance of going undetected that we’ve ever seen.

ScreamingFist · Jul 22, 2024

omarsidd said:
The over-dependency on enterprise "cybersecurity" providers is inherently dangerous, exactly like this event shows. I realize it checks the liability boxes so senior execs can say they performed diligence and get their massive pay regardless of how horrific a cock-up happens on their watch, but from a system design point of view, it's questionable. At best.

It's not especially a problem with security software. It's that the trend is for third parties to have hooks deep in the critical path of your organisation, with little to no oversight or business continuity available. This is presented as a win-win - the vendor wins a subscription revenue model that is almost impossible for the customer to exit, and the organisation wins being able to get rid of their above-minimum-wage staff because the SaaS manages everything for them.

The problem is that there's no fallback for the organisation when things go wrong.

steelcobra · Jul 22, 2024

View: https://youtu.be/wAzEJxOo1ts?si=g2ED-tz8hmXhBTM0

This video explains it pretty well. Crowdstrike runs as a boot-loaded kernel mode driver where the "driver' itself is signed, but they push code updates to it that it runs in kernel mode. This particular update was making bad system calls referencing an empty table, which of course causes Windows to stop because something that's not supposed to happen is happening every time CS' software starts running.

And then you have to manually load every affected machine into safe mode and delete a specific file.

To make it worse, according to a commentator CS has the ability to tag updates so they skip deployment staging policies, so instead of this only hitting his test group, then non-critical, then all, it went live on all users.

dmoan · Jul 22, 2024

mangoslice said:
Yes the formatting breaks on mobile too

Take a wild guess who was CTO when similar incident happened in McAfee in 2010

doubleyewdee · Jul 22, 2024

iljitsch said:
Those systems would still boot.

[mike drop]

If a system boots and isn't reachable on a network or is blocking any services from starting because it incorrectly thinks they're malware, requiring physical access to restore functionality, how is that actually better?

Also, be kinder to Mike next time. He's a soft boy.

mohnish82 · Jul 22, 2024

sjl said:
It's a universal problem, not limited to Microsoft. A bad kernel module will kill a Linux install as well. A driver, pretty much by definition, has to run with kernel level privileges; and at that level, a mistake in the code cannot be trapped - it's going to bring the system down.

Some things that are currently kernel modules can be moved to userspace - but some things cannot. (And doing so does bring certain tradeoffs - for example, GPU drivers can be in userspace, but there is a performance hit in doing so. Given how complex GPU drivers are these days, that's a worthwhile tradeoff IMO; but it is a tradeoff.)

CrowdStrike made the choice - rightly or wrongly - to implement their code as a kernel level driver. Their code caused these crashes. Ergo, CrowdStrike is wholly to blame for this. Microsoft might be able to implement improvements that allow more stuff that's currently in the kernel to move into userspace - but that's a separate issue.

If you write stuff that runs with kernel-level privileges, it is on you to make sure your code is robust. There is only so much that the OS vendor can do to limit the damage in that scenario.

Microsft should permit customers to mark a driver - boot start: false, if they choose so. Allowing manual override over (CrowdStrike etc's) boot start: true hard-code.

steelcobra · Jul 23, 2024

mohnish82 said:
Microsft should permit customers to mark a driver - boot start: false, if they choose so. Allowing manual override over (CrowdStrike etc's) boot start: true hard-code.

It's enterprise security software, it's deployed en masse and designed so users can't mess with it. In theory it's designed so that sysadmins can designate deployment rings so this kind of thing only hits a small test group. Instead Crowdstrike hid that they can tell a patch to deploy to all anyways. Nobody knew they might have to do something like that.

mohnish82 · Jul 23, 2024

steelcobra said:
It's enterprise security software, it's deployed en masse and designed so users can't mess with it. In theory it's designed so that sysadmins can designate deployment rings so this kind of thing only hits a small test group. Instead Crowdstrike hid that they can tell a patch to deploy to all anyways. Nobody knew they might have to do something like that.

.. users can mess with it .. is something you assumed and went on with. Giving an option, of course, means giving it to the authorized user.

I'm not discounting Crowdstrike's hand in this. But, the point being that there would always be a company like Crowstrike. Microsoft needs to step up its game & think about a good solution.

E.g. iOS could have also said that web tracking via the browsers is unavoidable and done nothing, or passed the blame down to someone who implements intrusive tracking. But they tried to do something about it, by forcing the use of webkit on their system to put it in check.

Note: .. Ars community - please don't start on .. Apple did this .. only to .. because ... their etc. etc. That's not the point of the example.

SparkE · Jul 23, 2024

nzod said:
It's hard to top this one, lol.

jdw said:
Have you tried turning it off and on and off and on and off and on and off and on and off and on and off and on and off and on and off and on and off and on and off and on again?

Apparently doing the same thing over and over again and expecting different results is no longer the height of insanity.

iljitsch · Jul 23, 2024

Not long ago I remarked that with Apple stuff, doing the same thing over and over and expecting the same results is now insanity.... (Or rather, produces insanity.) I guess that's not unique to Apple.

Nerdboi · Jul 23, 2024

Fatesrider said:
Is it just me, or did anyone else read this and flash back to the time when their dad kept banging on the side of the old CRT TV's to get it to show a picture?

I resemble that remark. Citation: I am old

And you are correct. It feels just like that.

EnPeaSea · Jul 23, 2024

fenncruz said:
I think a magical space dragon did this, and I have as much evidence as you do!

Why not both!? Putin collected the Dragon Balls and made a wish!

dread pirate nancy said:
May we also nominate ClownStrike?

Is that Clown-Strike or Clown's-Trike?

real mikeb_60 · Jul 23, 2024

EnPeaSea said:
Why not both!? Putin collected the Dragon Balls and made a wish!

Is that Clown-Strike or Clown's-Trike?

Oh no, now I'm going to have to tune in to one of the olde TV subchannels where they play Laugh-In in the wee hours...

sjl · Jul 23, 2024

mohnish82 said:
Giving an option, of course, means giving it to the authorized user.

You're thinking on a small, individual user scale - not an enterprise scale. The rules and requirements change radically when there are hundreds or thousands of systems to manage. Giving the end users the ability to turn off security critical software is exactly the sort of thing that cyber security teams are dead set against. It means you end up with non-compliant systems, and in some industries, that's an absolute no. Even in industries where it's not an absolute no, it's still something that the IT team would be extremely hesitant about.

I've done my time as a systems administrator. I've had full blown super user privileges to mission critical systems. And yet my current employer has my laptop so locked down, there's a bunch of stuff that I have to go to corporate IT to get done, despite having the knowledge and skills to do it myself. And bluntly, whilst it is a hassle for me at times, I get it, and I can't really argue against it - because I don't really need that level of access to do my job on a day to day basis. Removing that level of access reduces the potential attack surface significantly. Same deal with not allowing users to disable security software.

CrowdStrike fixes start at “reboot up to 15 times” and get more complex from there

Ars Praetorian

Wise, Aged Ars Veteran

Ars Legatus Legionis

Ars Scholae Palatinae

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Praefectus

Ars Tribunus Angusticlavius

Ars Centurion

Ars Tribunus Angusticlavius

Smack-Fu Master, in training

Ars Centurion

Ars Tribunus Angusticlavius

Ars Legatus Legionis

Account Banned

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Tribunus Militum

Ars Legatus Legionis

Ars Centurion

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praetorian

Ars Tribunus Angusticlavius

Ars Praetorian

Ars Scholae Palatinae

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Wise, Aged Ars Veteran

Wise, Aged Ars Veteran

Ars Tribunus Angusticlavius

Ars Scholae Palatinae

Ars Praetorian

Ars Tribunus Angusticlavius

Ars Tribunus Militum

nproxy.org