Global Microsoft outage hits due to CrowdStrike Update Definitions

I mean, the really crazy part to me is who allowed this update to roll out in a large org without running it through a test deployment group first? Seems like a lot of companies....

Most of this started because crowdstrike does crappy QA...but never trust a vendor...test any and every update before you push it out to your environment
 
So this was an AV/security definition update- not a traditional software patch or the like. Most security software pull these often - think 3-4 times a day - so they're not like a traditional "test/qa/production" patch. You also REALLY don't want them to be, since they're fixes/workarounds/blocks for just-discovered zero day exploits and other vulnerabilities. If you delay installing them, you risk getting hit since everything is constantly under attack. It appears that the file was corrupted somehow.

No one is sure why this one went so wrong, but there have been billions of these pushed without issue.
 
So this was an AV/security definition update- not a traditional software patch or the like. Most security software pull these often - think 3-4 times a day - so they're not like a traditional "test/qa/production" patch. You also REALLY don't want them to be, since they're fixes/workarounds/blocks for just-discovered zero day exploits and other vulnerabilities. If you delay installing them, you risk getting hit since everything is constantly under attack. It appears that the file was corrupted somehow.

No one is sure why this one went so wrong, but there have been billions of these pushed without issue.

Yea am aware, i've been in IT for 20 years. We here have a strict update testing regime even for zero days for the software that can crush your prod systems such as security software that sits at the kernel level. Everything and I do mean everything goes though test before it's deployed unless it goes though the exception process. Auto updates are for chumps who trust vendors...and I sure as heck don't trust vendors.
 
What exactly is that software doing with those kernel modules, anyway? I mean when it is working.

Is that known or are they excused from the question because they are a commercial software vendor?
 
I mean, the really crazy part to me is who allowed this update to roll out in a large org without running it through a test deployment group first? Seems like a lot of companies....

Most of this started because crowdstrike does crappy QA..

Is that true in general? If true, then I would be recommending a different vendor to my top management.
.but never trust a vendor...test any and every update before you push it out to your environment
You would think ...
 
Yea am aware, i've been in IT for 20 years. We here have a strict update testing regime even for zero days for the software that can crush your prod systems such as security software that sits at the kernel level. Everything and I do mean everything goes though test before it's deployed unless it goes though the exception process. Auto updates are for chumps who trust vendors...and I sure as heck don't trust vendors.
Can’t turn it off with crowdstrike that I’m aware of - or most other SaaS based solutions.

Heck, most cyber insurance companies require the threat feed updates now.
 
This is even more of a cluster f trying to fix your VM's if in the cloud, such as Azure.

No console access...

Basically, the fix is to create another "repair" VM in the same resource group.. attach the disk to it from the problem machines.. then mount them in the repair machine.. you can then browse to the file that needs deleted...
Unmount it.. and attach it back to the problem Vm and boot her up
 
What exactly is that software doing with those kernel modules, anyway? I mean when it is working.

Is that known or are they excused from the question because they are a commercial software vendor?
Searching for active exploitation of known vulnerability vectors. If you see software doing X, and we know X leads to ransomware, and said package isn’t on a known list of valid things and talking to a. Valid endpoint - stop activity X.
 
This is even more of a cluster f trying to fix your VM's if in the cloud, such as Azure.

No console access...

Basically, the fix is to create another "repair" VM in the same resource group.. attach the disk to it from the problem machines.. then mount them in the repair machine.. you can then browse to the file that needs deleted...
Unmount it.. and attach it back to the problem Vm and boot her up
Restore from backup?
 
I believe that was one of the options presented..
Was MS suggestion, but that does not help with a system that may have new data on it since the last backup. There are methods but what a pain in the end for large scale deployments. The "15 reboots" trick MS says to try look more and more appealing
 
This is even more of a cluster f trying to fix your VM's if in the cloud, such as Azure.

No console access...

Basically, the fix is to create another "repair" VM in the same resource group.. attach the disk to it from the problem machines.. then mount them in the repair machine.. you can then browse to the file that needs deleted...
Unmount it.. and attach it back to the problem Vm and boot her up

I have done that to fatfingered FreeBSD aws instances once or twice. Or was it to move to different disks? Don't remember.

I was actually thinking about placing such recovery sub-images by default so that the reaction time in case of error is faster.
 
Yeah 100% fake. Here's the source image he used in his chop.

View attachment 666665

I mean even without that, who would let an intern with no in-company experience touch production code?

And what kind of person that could get hired as a software engineer would be stupid enough to post an image of themselves saying that they're pushing code out into production right before a weekend, and then taking the rest of the day off on social media? There's no way anyone can be that dumb...

Right?

Please tell me I'm right.

Anyway, I had a friend that works in IT and he had to deal with this at his company. Texted us about it. Luckily our company wasn't hit much afaik. There's some benefit to waiting to roll out windows updates.
 
I think this would be a reason for not associating an OS to an Hotmail account.
 
Can’t turn it off with crowdstrike that I’m aware of - or most other SaaS based solutions.

Heck, most cyber insurance companies require the threat feed updates now.

Now you know why we don't have crowdstrike in house, I'd bet after this incident you might see that feature available.

Incredibly short sighted to think auto updates from any vendor are always ok. Back we I ran our tippingpoint gear it was a min two day test of any new sigs before we pushed it to prod. If a large zero day hit, we could push but the CTO had to sign off along with everyone under him. In my 6ish years we only ever had one they made us push though.
 
Now you know why we don't have crowdstrike in house, I'd bet after this incident you might see that feature available.

Incredibly short sighted to think auto updates from any vendor are always ok. Back we I ran our tippingpoint gear it was a min two day test of any new sigs before we pushed it to prod. If a large zero day hit, we could push but the CTO had to sign off along with everyone under him. In my 6ish years we only ever had one they made us push though.
It’s not an update though. It’s a definition file. There have been literally billions of these pushed without issue. The change rate on them is getting to the point you could have people working 24/7 and not keep up - it’s not like the old days with weekly or monthly ones, hard to test when there are multiple a day!
 
Understood, I know I wouldn’t have any software on my systems/environment in which I can’t test updates or signature files. It’s literally introducing significant third party risk into your environment.
 
When worlds collide…poor merc team…at least crowdstrike is paying them

IMG_1723.jpeg
 
It’s not an update though. It’s a definition file. There have been literally billions of these pushed without issue. The change rate on them is getting to the point you could have people working 24/7 and not keep up - it’s not like the old days with weekly or monthly ones, hard to test when there are multiple a day!
Some definitions can come as fast as every 15 minutes.
 
If it was a changed definition file that leads to this kernel panic that is even worse. It means there's a memory error in the definition file parser or in the processing. Such a memory error can easily lead to something exploitable.
 
If it was a changed definition file that leads to this kernel panic that is even worse. It means there's a memory error in the definition file parser or in the processing. Such a memory error can easily lead to something exploitable.
Nah the definition flags part of the OS itself as a problem and kills it. System BSOD’s and you have a bad time.
 
It’s not an update though. It’s a definition file. There have been literally billions of these pushed without issue. The change rate on them is getting to the point you could have people working 24/7 and not keep up - it’s not like the old days with weekly or monthly ones, hard to test when there are multiple a day!
There has to be some testing though right, surely they are not updating and pushing straight to production without any kind of quick validation?
 
There has to be some testing though right, surely they are not updating and pushing straight to production without any kind of quick validation?
Who customers? Nope. It pulls as they update. It’s a SaaS model- and it’s a rapidly moving target. How do you test 4 updates a day?

It’s been that way for 8+ years. This is the first time it’s gone sideways in a noticeable way. It’s 3+ times a day. Across billions of devices. It tends to be insanely reliable.

Now the vendor? They test. This got corrupted somewhere.
 
If it was a changed definition file that leads to this kernel panic that is even worse. It means there's a memory error in the definition file parser or in the processing. Such a memory error can easily lead to something exploitable.
Yup. Totally nuts garbage data that triggered… something. Better error handling must exist. Don’t crash. Panic the app.
 
Understood, I know I wouldn’t have any software on my systems/environment in which I can’t test updates or signature files. It’s literally introducing significant third party risk into your environment.
Sadly the cyber insurance company will tell you without it you have significant external risk. To the point they won’t provide coverage.

I mean hell - there have been a day with a dozen updates. How the heck would it be tested on the consumer side 😂
 
Who customers? Nope. It pulls as they update. It’s a SaaS model- and it’s a rapidly moving target. How do you test 4 updates a day?

It’s been that way for 8+ years. This is the first time it’s gone sideways in a noticeable way. It’s 3+ times a day. Across billions of devices. It tends to be insanely reliable.

Now the vendor? They test. This got corrupted somewhere.
Yup not customer, vendor should be testing this, how did they miss
 
The bugcheck was PAGE_FAULT_IN_NONPAGED_AREA. It's a kernel driver csagent.sys. If the kernel driver tries to load a file and fails to handle that, you get a page fault. I'm 100% sure their CEO wants a full internal investigation to make sure this doesn't happen again. I mean, the stock lost 20% of its value. The board of directors will demand changes, if nothing else. For me, I do hope they stop "this dozen of updates in a day crap" and stagger the updates across their agents, instead of all at once.
 
Who customers? Nope. It pulls as they update. It’s a SaaS model- and it’s a rapidly moving target. How do you test 4 updates a day?

Could let 10% of clients update at a time. If it's 100% crashing, the deployment gets stuck at 10% because they never check back in after the update and you only have to fix 10% of your machines. It's just a smoke test, not a thorough test, but it also can go pretty quick, if you really need stuff pushed quickly. You can also do a growth thing like every client that updates successfully unlocks two update slots, etc. The key thing is some sort of feedback loop other than waiting for the world to stop.
 
The bugcheck was PAGE_FAULT_IN_NONPAGED_AREA. It's a kernel driver csagent.sys. If the kernel driver tries to load a file and fails to handle that, you get a page fault.

No, it is simple memory corruption, hence my remark that the error in the code (not the one in the definitions update) could have led to security vulnerabilities.

Also compare to all that recent talk about memory-safe languages, especially Rust.
 
Could let 10% of clients update at a time. If it's 100% crashing, the deployment gets stuck at 10% because they never check back in after the update and you only have to fix 10% of your machines. It's just a smoke test, not a thorough test, but it also can go pretty quick, if you really need stuff pushed quickly. You can also do a growth thing like every client that updates successfully unlocks two update slots, etc. The key thing is some sort of feedback loop other than waiting for the world to stop.
Not a horrible set of ideas by any means, assuming you’re allowing that kind of communication. But not a bad idea. But still 100% automated - things can go wrong.
 
No, it is simple memory corruption, hence my remark that the error in the code (not the one in the definitions update) could have led to security vulnerabilities.

Also compare to all that recent talk about memory-safe languages, especially Rust.
Very very valid. Extremely so. Congrats - you found a buffer overflow that doesn’t allow access but works as a DDOS
 
No, it is simple memory corruption, hence my remark that the error in the code (not the one in the definitions update) could have led to security vulnerabilities.

Also compare to all that recent talk about memory-safe languages, especially Rust.
It’s not, it’s a definition update to the pipeline sensor.
The pipeline sensor is not kernel level it’s a monitor that supervises communication between windows processes.
In this case it flagged a normal operating process as bad and killed critical windows communication processes.
Its part of the Falcon system and the pipeline sensor monitors for memory based attacks and exploits.

The update was to counter an active C2 framework malware package that was circulating.

It looks like a very stupid error, instead of “if pattern matches this signature, kill it and report”, they released “if pattern doesn’t match this signature, kill it and report”

So it kills all memory interactions except the one they were aiming for.
 
Back
Top