When the Clouds Darkened - Unpacking the AWS US-East-1 Outage

The Outage Begins

The day they started like any other Tuesday. However, at approximately 10:30 AM Eastern Time, things began to go awry in Amazon's North Virginia region, known as "us-east-1". Over the next three minutes, several regional AWS services started experiencing issues. This wasn't a complete outage for all these services. Still, their functionality was significantly impaired, causing a ripple effect that disrupted the operations of many businesses and services relying on AWS.

The services affected included the Management console, Route 53, API Gateway, EventBridge, Amazon Connect, and EC2. The outage affected all availability zones in us-east-1 and disrupted a number of global services that are homed in this region. This included AWS Account root logins, Single Sign-On (SSO), and the Security Token Service (STS).

Event Timeline

7:30 AM PST: An automated activity to scale capacity of one of the AWS services triggered an unexpected behavior, resulting in a significant surge of connection activity that overwhelmed the networking devices between the internal network and the leading AWS network. This caused delays in communication between these networks, leading to increased latency and errors.
9:28 AM PST: AWS teams completed work on moving the internal DNS traffic away from the congested network paths, which resulted in the complete recovery of DNS resolution errors. This improved the availability of several impacted services by reducing the load on the impacted networking devices but did not fully resolve the AWS service impact or eliminate the congestion.
1:34 PM PST: Congestion significantly improved as the operations teams applied remediation actions.
2:22 PM PST: All network devices fully recovered. However, full recovery for Amazon Secure Token Service (STS) occurred at 4:28 PM PST.
4:37 PM PST: API Gateway largely recovered, but customers may have continued to experience low errors and throttling for several hours as API Gateways fully stabilized.
5:00 PM PST: Fargate API error rates began returning to normal.
6:40 PM PST: EventBridge experienced high event delivery latency until this time as it processed the backlog of events.

The Impact

The impact of the outage was broad and far-reaching. It caused various problems for services like Netflix, Disney Plus, Roomba, Ticketmaster, and the Wall Street Journal. It also affected many Amazon services, including Prime Music, Ring doorbells, logistics apps in their fulfillment centers, and some parts of the Amazon.com shopping site.

The outage highlighted the interconnectedness of digital services and the potential domino effect that can occur when a primary cloud service provider experiences issues. It was a stark reminder of the potential risks of relying heavily on a single cloud service provider.

The Root Cause

After the dust settled, AWS provided a detailed explanation of what had caused the outage. The root cause was an unexpected behavior triggered by an automated system trying to scale up an internal service running on AWS's private internal network. This led to a significant surge of connection activity that overwhelmed the networking devices between the internal network and the leading AWS network, resulting in delays in communication between these networks.

These delays increased latency and errors for services communicating between these networks, leading to persistent congestion and performance issues on the devices connecting the two networks. This congestion immediately impacted the availability of real-time monitoring data for AWS's internal operations teams, which impaired their ability to find and resolve the source of congestion.

The Aftermath and Lessons Learned

The outage had a significant impact on AWS's reputation. However, it also provided valuable lessons for AWS and its users. AWS has since taken several actions to prevent a recurrence of this event. They have turned off the scaling activities that triggered the event and are working on fixing a latent issue that prevented their systems from adequately backing off during the event.

This incident serves as a reminder of the importance of designing robust and resilient systems. It also underscores the need for effective communication during operational issues. AWS has acknowledged the need to improve customer communication during such events and is planning significant upgrades.

Looking Forward

While the AWS outage was a significant event that caused widespread disruption, it also provided valuable lessons. It highlighted the importance of resilience in system design and the need for effective communication during operational issues. As AWS continues to learn from this event and improve its systems, users can also take this opportunity to evaluate their systems and consider how they can enhance their resilience to such events in the future.

In the world of cloud computing, outages are inevitable. However, they can be mitigated through robust system design, effective communication, and continuous learning from past incidents. The December 2021 AWS outage was a stark reminder of this reality. As we progress, cloud service providers and users must take these lessons to heart, continually striving to improve system resilience and response strategies.