CrowdStrike Down

Logo: © CrowdStrike

CrowdStrike was once the world's largest and most respected IT security provider. However, a minor update turned them into the culprits behind the worst IT disaster in history.

A single software update caused the crash of 8.5 million computers, leading to the shutdown of planes, banks, hospitals, and governments and impacting half the world.

Timeline of the CrowdStrike IT Disaster - July 19, 2024 (IDT)

  • 07:00 IDT
    The CrowdStrike team has begun preparing to roll out a new update for their flagship product, CrowdStrike Falcon Sensor. This update is intended to help the sensor identify the latest malicious software—a routine activity performed several times daily.

  • 07:09 IDT
    The update is deployed, and PCs with Falcon product sensors worldwide download the new information. Initially, everything appears to be functioning normally.

  • Shortly After 07:09 IDT
    Issues begin to surface. Windows PCs that received the update abruptly experience the Blue Screen of Death. This is widespread, affecting all PCs with the update. The affected PCs attempt to reboot three times before entering recovery mode to prevent further damage.

  • 08:27 IDT
    Realizing the severity of the situation, the CrowdStrike team halts the rollout and deploys a fix. However, the damage is already widespread, setting the stage for one of the worst IT disasters in history.

The Disaster Unfolds

  • Afternoon in Asia and Australia (09:00 - 15:00 IDT)
    Workers notice disruptions as cafes and retail systems suddenly stop working. Inventory management systems also malfunction, preventing new orders from loading or tracking. Office workers experience the Blue Screen of Death and are prompted to enter a BitLocker recovery key to decrypt and recover their systems.

  • 09:00 IDT
    In Europe, people starting their workday encounter the same issues. Office equipment fails to start, rendering employees unable to begin their work.

  • 10:00 IDT
    Various US states, including Arizona and Alaska, see their 911 services suddenly go offline. Emergency response workers scramble to find a solution, but confusion reigns. Simultaneously, hospitals like Penn Medicine in Pennsylvania and Northwell Health in New York experience system crashes, forcing the delay of non-urgent visits and surgeries.

The Aviation Crisis

  • 11:00 IDT
    The Federal Aviation Administration (FAA) identifies similar issues: flight navigation systems fail. With most flights relying on automated systems, this becomes a major problem. The FAA orders every flight from Delta, American Airlines, United, and Allegiant Airlines to be grounded. Over 5,000 flights are grounded, and 35,000 are delayed, leaving over 1 million passengers stranded.

CrowdStrike Responds

  • 12:45 IDT
    Amidst peak panic, CrowdStrike CEO George Kurtz issued a public statement via Twitter: “CrowdStrike is actively working with customers impacted by a defect found in a single content update for Windows hosts. Mac and Linux hosts are not impacted. This is not a security incident or cyberattack. The issue has been identified as isolated, and a fix has been deployed.” However, the fix requires manual intervention on all affected computers—over 8 million.

The Aftermath

  • 13:00 - 14:00 IDT
    Public transport in the Northeast United States grinds to a halt as trains and buses cannot depart. Passengers are left waiting with no idea when services will resume.

  • 15:00 IDT
    Global and regional banks experience outages, and their customer portals stop functioning. Simultaneously, most of the computers in the US federal government have become unusable. An investigation begins, with IT support workers working around the clock to restore systems.

  • 17:00 IDT
    Some flights cautiously resume, and some companies have their computers back online, allowing business to resume. However, many issues persist.

  • 19:00 IDT
    Federal agencies struggle with the issues, and hospitals still report outages. President Biden is briefed on the crisis. CrowdStrike’s CEO apologizes for the outage and offers support to affected customers. A full report of the event is released, guiding repairing systems.

Much of the world was back online within a day, but massive issues still lingered, given that each device had to be manually fixed. Less than 1% of all Windows machines were still affected, but these weren’t just any devices—they were at the heart of banks, airlines, hospitals, governments, and more. Even at only 1%, the impact was massive. It’s estimated that Fortune 500 companies experienced close to a $5.4 billion financial loss. This event was a disaster, all because of one new piece of information.

What Happened?

So, how did all of this happen? How did one update impact the IT world so significantly and so quickly?

It comes down to how CrowdStrike works. CrowdStrike’s Falcon Sensor is not just antivirus software—it’s the ultimate antivirus software.
Think of the product as the immune system. It does a fantastic job of looking for threats and can automatically neutralize them. But when it malfunctions, it can cause terrible consequences.
CrowdStrike’s Falcon Sensor is very similar. It operates at the lowest possible level of your computer—not just amongst user software, but in the “kernel,” the essential program that runs your operating system and talks to hardware.

This means that Falcon Sensor can see everything at the lowest level of monitoring, making it highly secure and the best line of defense. But that also comes with downsides and risks. There are fewer barriers between it and the hardware.
If something goes wrong with user software, the program crashes. But the whole device crashes if something goes wrong in the kernel program. This is precisely what happened.

CrowdStrike didn’t release a major update—it was the smallest piece of new information to help identify new malicious software. However it was faulty information that created a logic error. This didn’t cause Windows to crash immediately, but it did cause stability issues. Windows immediately crashed the computer to prevent further damage when it noticed this. The blue screen of death was Windows protecting computers from the alternative.

The Cause (TL: DR)

CrowdStrike carries automated checks for these new updates, called a “Content Validator.” This can only go so far in identifying bugs, but given that CrowdStrike had rolled out thousands of such updates in the past, they didn’t feel that anything more was necessary. CrowdStrike says, "Due to baseline trust from the previous tests and successful deployments, no additional testing like dynamic checks was performed, so the bad update reached clients, causing the massive global IT outage.” They probably would have noticed the issue if they had tested this manually on any Windows PC.

The company dropped the ball and published the update without thorough testing.
Given how much code companies like CrowdStrike deploy, some bad code will likely make it through, no matter how much testing they have.
And that brings us to why cybersecurity experts are furious at CrowdStrike.
They’re furious not because CrowdStrike developed some faulty code but because of how they deployed it.

Root Cause Analysis — Channel File 291

The CrowdStrike Falcon sensor leverages powerful on-sensor AI and machine learning models to protect customer systems by identifying and remediating the latest advanced threats. These models are continually updated and strengthened with insights from threat telemetry and intelligence gathered by CrowdStrike's security teams. The data begins as filtered and aggregated information on each sensor in a local graph store. The sensor correlates this context with live system activity to identify behaviors and indicators of attack (IOAs).

A vital part of this process is the Sensor Detection Engine, which combines built-in Sensor Content with Rapid Response Content delivered from the cloud. Rapid Response Content allows the sensor to gather telemetry, identify indicators of adversary behavior, and enhance detection capabilities without requiring code changes on the sensor.

Rapid Response Content is delivered through Channel Files and interpreted by the sensor’s Content Interpreter, which uses a regular-expression-based engine. Each Rapid Response Content channel file is associated with a specific Template Type built into the sensor. This Template Type provides the Content Interpreter with activity data and graph context to be matched against the Rapid Response Content.

With the release of sensor version 7.11 in February 2024, CrowdStrike introduced a new Template Type designed to detect novel attack techniques that abuse named pipes and other Windows interprocess communication (IPC) mechanisms. This new IPC Template Type was developed, tested, and integrated into the sensor following standard procedures. IPC Template Instances are delivered to sensors via a corresponding Channel File numbered 291.

However, the new IPC Template Type defined 21 input parameter fields, but the integration code invoked the Content Interpreter with Channel File 291’s Template Instances supplied only 20 input values. This mismatch went unnoticed during multiple build validation and testing layers, including sensor release testing and stress testing of the Template Type with initial IPC Template Instances.

On July 19, 2024, two additional IPC Template Instances were deployed. One of these introduced a non-wildcard matching criterion for the 21st input parameter. This led to a new version of Channel File 291 requiring the sensor to inspect the 21st input parameter. The Content Validator evaluated the new Template Instances based on the expectation that 21 inputs would be provided.

As a result, sensors that received the new version of Channel File 291 were exposed to a latent out-of-bounds read issue in the Content Interpreter. When the new IPC Template Instances were evaluated against an IPC notification from the operating system, the Content Interpreter attempted to access the 21st value, which did not exist, leading to a system crash.

Technical Details

Here are the key technical components involved:

  • Content Interpreter: Part of the sensor C++ code that tests input strings against regexes.
  • Template Types: Predefined fields used by threat detection engineers to create Rapid Response Content.
  • Template Type Definitions File: Defines the parameters for each Template Type, including the expected number of inputs.
  • Rapid Response Content: Bundled Template Instances delivered via channel files.
  • Content Validator: Validates channel files against the Template Type Definitions file.

Crash Dump Analysis

The crash occurred due to an out-of-bounds memory read caused by the mismatch between the number of inputs provided and expected. Below is an excerpt from the crash dump analysis:

1: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

PAGE_FAULT_IN_NONPAGED_AREA (50)
Invalid system memory was referenced. This cannot be protected by try-except.
Typically the address is just plain bad or it is pointing at freed memory.
Arguments:
Arg1: ffffd6030000006a, memory referenced.
Arg2: 0000000000000000, X64: bit 0 set if the fault was due to a not-present PTE.
bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the processor decided the fault was due to a corrupted PTE.
bit 4 is set if the fault was due to attempted execute of a no-execute PTE.
- ARM64: bit 1 is set if the fault was due to a write, clear if a read.
bit 3 is set if the fault was due to attempted execute of a no-execute PTE.
Arg3: fffff8020ebc14ed, If non-zero, the instruction address which referenced the bad memory address.
Arg4: 0000000000000002, (reserved)

The out-of-bounds read occurred when the code attempted to access an invalid memory location due to the 21st input field mismatch. The faulty driver csagent.sys caused the crash as it tried to access a memory location beyond the allocated array.

This line shows the code instruction that caused the error:

csagent+0xe14ed:
fffff802`0ebc14ed 458b08 mov r9d,dword ptr [r8] ds:ffffd603`0000006a=????????

The invalid pointer in r8 led to an attempt to read beyond the allocated memory, resulting in a system crash.

Sources

  1. CrowdStrike Outage - YouTube
  2. CrowdStrike Outage Overview - YouTube
  3. CrowdStrike Incident Breakdown - YouTube
  4. CrowdStrike Outage Details - YouTube
  5. CrowdStrike Outage Analysis - YouTube
  6. In-Depth CrowdStrike Outage Review - YouTube
  7. CrowdStrike Outage Causes and Effects - YouTube
  8. CrowdStrike Incident Summary - YouTube
  9. Falcon Content Update Remediation and Guidance Hub - CrowdStrike
  10. 2024 CrowdStrike Incident - Wikipedia
  11. Helping Our Customers Through the CrowdStrike Outage - Microsoft Blog