How One Line of Code Almost Blew Up the Internet

The digital realm is intricately woven into the fabric of our modern society, and at its core lies an intricate network of codes. Each line of this code bears the potential to either sustain the digital world as we know it or disrupt it entirely. Let's delve into a fascinating incident that nearly shook the internet's foundation – a tale of how a single line of code in Cloudflare's system almost led to a catastrophic leakage of private information. This incident is a stark reminder of the fragility of the digital infrastructure we often take for granted and the critical importance of rigorous testing and code review processes. Buckle up for a deep dive into the incident that almost blew up the internet on February 24, 2017.

The Bug That Shook The Internet

On February 24, 2017, the internet faced a potential catastrophe. An incident involving Cloudflare, a company that provides performance and security services to millions of websites, threatened to spill private data across the web. The issue was a memory leak caused by a Cloudflare's HTML parser bug. Tavis Ormandy first reported the bug from Google’s Project Zero, who noticed corrupted web pages returned by some HTTP requests run through Cloudflare.

Cloudflare's edge servers, under certain unusual circumstances, were returning memory containing private data like HTTP cookies, authentication tokens, and HTTP POST bodies. Some of this data had even been cached by search engines, increasing the potential damage. While this sounds alarming, it's worth noting that Cloudflare customer SSL private keys were not leaked due to the isolation of SSL connections through an instance of NGINX unaffected by this bug.

Taking Immediate Action

The seriousness of the bug led to the immediate formation of a cross-functional team from software engineering, infosec, and operations in San Francisco and London. The team worked round the clock, with 12-hour shifts, to ensure a 24-hour work cycle. They aimed to understand the cause and the extent of the memory leakage and coordinate with Google and other search engines to remove any cached HTTP responses. Thanks to their dedication, what could have taken three months to resolve under industry standards was fully dealt with globally in under 7 hours, with initial mitigation happening in just 47 minutes.

The most significant impact period was between February 13 and February 18, with around 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulting in memory leakage. This translates to about 0.00003% of requests, but with the sheer volume of traffic Cloudflare handles, the implications were significant.

Unraveling The Root Cause

The root of the problem lay in Cloudflare's parsing system. Cloudflare uses a parser to modify HTML pages as they pass through their edge servers. It offers various services like inserting the Google Analytics tag, rewriting http:// links to https://, and more. This parser was written using Ragel, a state machine compiler, which had become too complex to maintain over the years. Consequently, Cloudflare began transitioning to a new parser named cf-html, which was faster and easier to maintain.

The memory leak bug had been in the Ragel-based parser for years but was only triggered when the new parser cf-html was introduced. This caused changes in the buffering system, enabling the memory leak. As soon as the correlation with cf-html was identified, Cloudflare disabled three features that used it - email obfuscation, Server-side Excludes, and Automatic HTTPS Rewrites, effectively stopping almost all memory leaks within a few seconds. A third feature, Server-Side Excludes, was also vulnerable and quickly patched within three hours.

The Infamous One Line of Code

The root cause of the bug was a pointer error in the generated C code from Ragel.
The end of a buffer was checked using the equality operator, allowing the pointer to step past the end of the buffer, resulting in a buffer overrun. Had the check been done using the '>=' operator instead of '==', the buffer overrun would have been caught.

The Ragel code contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun. This error was triggered particularly when parsing broken HTML tags at the end of a web page. An example of such a tag is <script type=. In such cases, the buffer would be overrun. Statistically, such broken tags occur on about 0.06% of websites.

Incident Timeline

  • At an unspecified date, Cloudflare decided that their Ragel-based parser had become too complex to maintain, so they started to write a new parser, named cf-html, to replace it. This parser was used to modify HTML pages on the fly as they passed through Cloudflare's edge servers, for functions such as obfuscating email addresses, enabling AMP, inserting Google Analytics tags, and more.

  • A bug existed in Cloudflare's use of the Ragel-based parser that would result in a memory leak. However, this bug was not exposed until the introduction of cf-html, which subtly changed the buffering and enabled the leakage. The bug was not in the Ragel parser or cf-html itself, but in the way Cloudflare used Ragel.

  • On February 13, Cloudflare changed the Email Obfuscation feature, which was one of the features that activated cf-html. This became the primary cause of the memory leak.

  • Between February 13 and February 18, approximately 1 in every 3,300,000 HTTP requests through Cloudflare potentially resulted in memory leakage, which is about 0.00003% of requests.

  • On an unspecified Friday, Tavis Ormandy from Google’s Project Zero contacted Cloudflare to report a security problem with Cloudflare's edge servers. He was observing corrupted web pages being returned by some HTTP requests run through Cloudflare. The edge servers were running past the end of a buffer and returning memory that contained private information such as HTTP cookies, authentication tokens, HTTP POST bodies, and other sensitive data.

  • Cloudflare responded by forming a cross-functional team from software engineering, infosec, and operations in San Francisco and London. They worked to fully understand the underlying cause, understand the effect of the memory leakage, and to work with Google and other search engines to remove any cached HTTP responses.

  • 47 minutes after receiving details of the problem, Cloudflare activated the global kill for the Email Obfuscation feature, stopping almost all memory leaks. 3 hours and 5 minutes later, they activated the global kill for the Automatic HTTPS Rewrites feature. They confirmed that they were not seeing memory leakage via test URIs and had Google double check that they saw the same thing.

  • Cloudflare discovered that a third feature, Server-Side Excludes, was also vulnerable. Since it was an older feature, it did not have a global kill switch. They implemented a global kill for Server-Side Excludes and deployed a patch to their fleet worldwide, which took roughly three hours.

  • The root cause of the bug was a pointer error. The Ragel code that Cloudflare wrote contained a bug that caused the pointer to jump over the end of the buffer and past the ability of an equality check to spot the buffer overrun. This error was found to occur when the webpage ended with a broken HTML tag. From their statistics, they found that such broken tags at the end of the HTML occur on about 0.06% of websites.

  • Cloudflare stated that they fixed the bug and its consequences globally in under 7 hours with an initial mitigation in 47 minutes. They also emphasized that Cloudflare customer SSL private keys were not leaked and that they have not discovered any evidence of malicious exploits of the bug or other reports of its existence.

A Close Call and Lessons Learned

This incident was a close call for the internet. With a significant number of websites relying on Cloudflare's services, the potential leak of private data could have had far-reaching consequences. However, the swift action taken by Cloudflare's team, their transparency about the problem, and the steps taken to mitigate the issue serve as an excellent example of handling security crises.

Moreover, the incident underscores the importance of rigorous testing and code review processes, especially when dealing with critical systems handling sensitive data. Even a seemingly innocuous code change can have unforeseen and widespread effects, reminding us all of the profound responsibility and potential impact of a single line of code.

Sources