How GitHub's Database Self-Destructed in 43 Seconds

Art By Midjourney

On October 21st, 2018, GitHub, the world's leading software development platform, experienced a significant operational mishap. During routine maintenance work on their East Coast network, connectivity to a hub was severed, creating a network partition between the East Coast data center and everything else. This incident led to GitHub losing half of its servers. Although the maintenance crew restored the connection in 43 seconds, GitHub experienced an additional 24 hours of service degradation. Users couldn't log in, stale data was served, and the website was read-only.

The Architecture of GitHub's Database

GitHub stores all its metadata, such as issues, pull requests, comments, notifications, and actions, in MySQL. They use a standard primary replica setup where writes go to the primary, replicating one or more read replicas. This setup improves read throughput and availability as reads can be distributed throughout all replicas, and if the primary fails, a fellow replica can be promoted to take its place.

GitHub has many services and features, and to maintain separation of concerns, this requires many database clusters, each having its designated primary. All primaries were in the East Coast data center, with replicas spread across both coasts. Specific replicas in the West functioned as intermediate primaries, or in other words. A replica can have replicas that can have more replicas. This may seem overcomplicated, but there are a few advantages here. In terms of cost, a single replication layer means a lot of cross-data center traffic which is very expensive. With an intermediate primary, the cross-data center hop is only performed once. They also reduce the load on the primary as it only replicates to its descendants.

The Role of Orchestrator

GitHub has a homegrown tool to manage all of this called Orchestrator, which comes with a UI to help visualize and manage database topologies. Most importantly, though, it supports automatic primary failover; the primary orchestrator in a database system will automatically failover and promote a new primary within 30 seconds if it detects that replicas cannot contact the primary. However, if the orchestrator is not configured carefully, it can implement topologies the application does not support.

The Incident and Its Aftermath

In the specific incident, replication lag caused data to be present in the East Coast clusters but not in the West Coast and writes timed out and failed for a few seconds. Orchestrator nodes in the West Coast and US East cloud established quorum to fail over since the primaries were unreachable to both orchestrator and replicas. Replicas in the US West database were promoted as the new primaries, and write traffic was restored. The East and West Coast databases contained unique writes not present in the other, causing inconsistencies after restoring the connection.

The engineers at GitHub were facing issues with the cross-country latency of their database services, making the website unusable for many users. They were stuck between a rock and a hard place, as reversing the situation would require removing the West Coast from service, which would cause data loss for many users. Some engineers believed failing back was the correct option, but others disagreed, stating that data integrity was paramount.

The Recovery Process

Eventually, GitHub prioritized data integrity over the speed of recovery and site usability, turning off certain features to prevent further database impact and drafting a plan. The plan was to fail forward and continue with U.S West as the primary until the East Coast clusters were fully synchronized. The East Coast required a restore from backup, and the engineers saved the unreplicated writes to reconcile later.

The recovery process took longer than expected, and the estimated time for recovery was off due to a miscalculation. The outage also caused a backlog of webhooks and page builds to process, which took some time to clear. Despite trying to avoid such situations, things sometimes don't work out as expected. GitHub experienced a seven-hour outage due to a maintenance mishap but restored normal operations and updated the site status to green.

Lessons Learned and Future Initiatives

The incident led to a rethinking of the capabilities of the orchestrator and the establishment of a crisper and more transparent reporting mechanism. The long-term goal is to allow the application to tolerate a complete failure of a single data center without impact.

GitHub's architecture is more mature than many of its competitors, but with grand scale comes excellent complexity. The incident served as a stark reminder of the challenges of managing large-scale, distributed systems. It also highlighted the importance of robust disaster recovery plans and the need for systems to be designed to handle unexpected failures gracefully.

In conclusion, the incident at GitHub was a significant event in software development and database management. It was a valuable lesson for other organizations about the importance of robust, resilient systems and the potential pitfalls of managing large-scale, distributed databases. It also underscored the importance of prioritizing data integrity over the speed of recovery, even in the face of significant service degradation.

Sources