The Great AT&T Network Outage of 1990

On a seemingly regular Monday afternoon in January 1990, at 2:25 pm, AT&T's Network Operations Center began receiving alarming notifications.

Their extensive network was experiencing an unusual number of warning signals, marking the onset of a massive network outage. This outage, which persisted for nine hours, had severe repercussions for airlines, businesses, and individuals, resulting in financial losses exceeding $60 million. This incident provides valuable insights into the complexities and vulnerabilities of large-scale network systems.

The Backbone of American Communications

AT&T's long-distance network was a crucial part of the American communications infrastructure, handling over 70% of the nation's traffic and routing over 115 million telephone calls daily. AT&T deployed 114 computer-operated electronic switches, known as 4ESS, across the United States to support this vast network. Each switch could handle up to 700,000 calls per hour. These switches were interconnected using a cascading network called CCSS7, which coordinated communications between switches.

How the Switching System Worked

When a call was initiated, the local switch analyzed possible routes to complete it. The call would then be directed through a series of switches until it reached the destination switch, which would connect it to the recipient's phone line. If the destination switch was unavailable, the originating switch would notify the caller that the recipient was unreachable.

An illustration showing how a call is routed through a series of switches in AT&T's network. Start with a caller making a call from a phone on the left. The call goes to the first local switch, which analyzes possible routes and directs the call to a series of intermediate switches represented as boxes with arrows connecting them, leading to the final destination switch. The destination switch connects the call to the recipient's phone on the right. Indicate the process of rerouting if the destination switch is unavailable, with an arrow showing a notification back to the caller.

The Incident Begins

A switch located in New York was scheduled for routine maintenance because it was nearing its load limit and experiencing overloading issues. During this maintenance, the switch would temporarily alert other switches that it would stop accepting requests for about four seconds.

The Cascading Failure

  1. Reset and Overwrite: After the reset, the New York switch retrieved data from the queue and forwarded the requests to the destination switch. The destination switch updated its records to indicate that the New York switch was back online. However, within less than 10 milliseconds, a second message arrived while the first was still being processed. Due to a critical software defect, this second message overwrote the communication information.

  2. Backup Activation: The software in the destination switch detected this overwrite and immediately activated a backup while attempting to reset itself. Unfortunately, another series of closely timed messages triggered the same issue in the backup switch.

  3. Propagation and Cascading Effect: Upon recovery, the destination switch read from its backlog and propagated the messages to other switches, leading to a cascading effect across the network.

Identifying and Resolving the Issue

Once the issue was identified, AT&T rolled back their software to the previous working version. They then spent a rigorous amount of time reading through code and testing to identify the break statement that caused the failure. After many hours of debugging, they pinpointed the issue to a specific block of code:

if (ring_write_buffer is empty) {
    send message to status map indicating sending switch is active;
}
else {
    break; 
// This statement caused the issue by exiting the switch statement
without processing the incoming message. }

This single break statement led to the cascading network failure and the significant financial losses.

Key Takeaways

  1. Continuous Testing: Just because your software was released and operational, it doesn't mean that it's bug-free. The software update responsible for the outage was released months before the incident and had undergone extensive testing. This highlights the importance of thorough and continuous testing to ensure system reliability, especially for edge cases.

  2. Systematic Risk Understanding and Mitigation: The cascading impact of a single software defect across the entire network illustrates the interconnectedness and vulnerability of complex systems. To minimize the potential for widespread failures, it is crucial to design systems with built-in fault tolerance, redundancy, and resilience. Additionally, an effective recovery plan can significantly mitigate the impact of unexpected incidents and restore operations swiftly.