CrowdStrike and Microsoft Outage

July 23, 2024 CyberSecOp Cybersecurity & Breach News

On July 19th CrowdStrike deployed a faulty patch/configuration update for its Falcon sensor software. The effected devices were those PC’s and servers running Windows operating systems. The outage, which affected systems worldwide, including Windows virtual machines and the MS Azure platform, began rebooting and/or crashing at approximately 10:48 AM Eastern. (The affected systems also included those running Windows 10 and 11 that were running CrowdStrike Falcon). Machines running macOS and Linux were NOT affected. It was noted in a number of sources that there was a similar issue for devices running Linux in April 2024.

At 9:27 AM Eastern, CrowdStrike deployed updated content. Devices that booted with this later content were not affected.

NATURE OF THE ISSUE:

CrowdStrike’s CEO, George Kurtz, confirmed the issue was due to a faulty kernel configuration level file and NOT the result of a Cyber Attack. Given the AT&T data breach just 10 days ago on July 12th, and the fact that there have been 10 major Cyber Attacks or Data Breaches so far in 2024, it was not surprising for the general public, as well as all of the affected all types and sizes of businesses ranging from airlines to hospitals to federal agencies and retail stores to immediately think that another Cyber Incident was the cause of the ensuing outages caused by the configuration issue. Thankfully this was not a cyber-attack, however, this issue does point out just how vulnerable the organizations we critically reliant on are.

THE FIX:

Should any organizations still be unable to fix their issues, there are a number of organizations, including CyberSecOp, that are able to help you with the relatively direct but painstaking task fix for the ‘outage’.

Affected Machines can be restored by booting into safe-mode or the Windows Recovery Environment and deleting any .sys files beginning with C-00000291- and with timestamp 0409 UTC in the %windir%\System32\drivers\CrowdStrike\ directory.
This process must be done locally on each individual device.
Someone will have to reboot the affected computers individually with manual intervention on each system.
NOTE: Some Azure customers have had success by rebooting the affected virtual machines numerous times (10,12,15 times was not unheard of) while connected to Ethernet.
NOTE: Microsoft has also recommended restoring from back-up from before July 18th.

HOW AND WHY THIS HAPPENED

Though at this time we cannot be 100% certain as to what caused this issue, we can expect that one of the primary culprits was the lack of testing and validation of the configuration update prior to its release. As noted in last week’s posting from CyberSecOp, it is absolutely critical for organizations of all sizes, across all market segments to properly plan and establish policies for their use, deployment and on-going updating of their technology ecosystem. In this case, whether or not CrowdStrike has the proper plans and controls in place to ensure they have screened and tested their upgrades, patches and releases is not in question – what is being questioned, and far more importantly is - have they been followed. Further, are those organizations and their managed services providers; effected by this latest outage properly testing and validating ANY changes to their environments before deploying them in to production environments? Is the CMDB in place?

THE MORAL OF THE STORY …

Organizations must develop, iterate on, and adhere to robust policies and procedures to enhance their change management processes. From a risk management standpoint, organizations should reconsider their operational strategies to ensure that dependency on a single vendor does not impact all of their operations. Diversifying vendors and creating competitive hedges can be crucial in times of disaster. Questions organizations should consider asking their managed services providers include:

Was this a planned update?
What testing was conducted in non-production environments, and what were the results?
What CMDB policies and procedures were overlooked?