On a seemingly ordinary day at GitLab HQ, a disaster was about to unfold. GitLab, an open-source code sharing platform, operates fully remotely, and on this day, one of their team members was about to sign off for the day, unaware of the catastrophe that lay ahead.
The Beginning of the Crisis
At 6 PM UTC, the team member was paged in due to high database load. Upon investigation, he noticed a dramatic increase in the total number of snippets, leading him to assume that the root cause was spam. This was a fair assumption as GitLab had been experiencing similar spam issues in the past week, albeit less severe. Over the next few hours, the team attempted to block certain spam IPs and delete spam users.
By 9 pm, they were alerted to an elevated number of locks in the database. Whenever write transactions are made to a record, it enforces a lock, forcing further writes to the record to wait for the first one to finish. This ensures that the writes do not interfere with each other. More spam meant more writes, more locks, and more latency. The engineers continued to search for other sources of spam.
The Alarm Bells Ring
Suddenly, a different alarm went off, this time for database replication lag. GitLab had two databases: a primary one and a secondary replica. Users would write to the primary database, which would then forward the same write to the secondary. The process of forwarding the identical writes to the secondary database is called replication, and over 4 gigabytes of data in the primary database failed to replicate in the secondary. This was a novel issue without proper documentation, so the team member stayed online to support the team.
The Failed Attempt to Restore
The team decided to use the PostgreSQL command
pg_basebackup to create a backup from the live database. The plan was to remove the existing incomplete data on the secondary database and run
pg_basebackup to copy the current primary database data to the secondary and then restart replication from there. However, the command failed, complaining that the primary database did not have enough replication clients.
In response, the team member SSH-ed onto the primary database and increased this value in the config. Upon attempting to reload Postgres, it complained that there were too many open connections, which happens when max connections is set too high. He lowered this value, and this time the settings applied without issue.
However, when he ran
pg_basebackup again, it appeared to just hang and do nothing. Frustration began to kick in. The engineers thought perhaps the prior attempts to run
pg_basebackup before the configuration changes had created some buggy files in the data directory, interfering with the current run. The fix would be to remove these files and try again.
The Catastrophic Mistake
"Well might as well give it a shot," thought the team member, "a hard reset to start on a clean slate, so to say." He prepared the command to
rm -rf the directory and ran it in his shell session. Immediately after pressing enter, he noticed the shell in which he ran the command was the one connected to the live production primary database. He tried to stop the command, but it was too late. Of the over 300 gigabytes of data, only 4.5 was left. The secondary database had previously been wiped of data before running the backup command. GitLab now officially had zero data in any of their production database servers.
The Recovery Process
The team scrambled to find a copy of the production data. They checked for the database backups that were supposed to be uploaded to S3 but found nothing there. They also checked for disk snapshots, butthey found they didn't actually take these snapshots for their database servers as they assumed the other backup procedures were sufficient. Lastly, they checked for logical volume snapshots or backups of the file system.
GitLab had a staging database for testing which periodically captured snapshots from the primary database to remain up to date with production. These snapshots were normally captured once every 24 hours, but luckily, the team member had taken a snapshot six hours before the outage.
Now there were two choices: they could copy either the entire LVM snapshot or just the Postgres data directory from staging to production. The amount of data was similar in both options, but they opted to copy the data directory as that would be easier to restore from.
However, their staging environment used Azure Classic in a different region, without premium storage, which they could not retroactively upgrade. Therefore, data transfer rates were limited to around 60 megabits per second. Copying the data to production took a solid 18 hours and nearly 24 hours later, GitLab was back up to normal operation.
The only caveat was that all database data such as projects, issues, and snippets created in the six hours between the LVM snapshot and the outage were permanently lost. This affected around 5000 projects, 5000 comments, and 700 users.
Post incident, it was discovered that the replication lag was actually caused by a background job trying to delete a GitLab employee's account due to it being reported for abuse by a troll in combination with the other spam.
The team also discovered that the database backups weren't uploaded to S3 because the server which was taking the database backups was using the wrong version of Postgres. These failures should have sent warning emails but they never received any of them because they forgot to enable DMARC authentication on the backup server.
The incident highlighted the importance of having a second pair of eyes to review commands before running them, especially in a production environment. It also emphasized the need for thorough load testing and regular testing of backup procedures.
In the end, the team member was not fired. There were many factors which led to that
rm -rf moment and many more factors which led to it taking 18 hours to recover from there, none of which were a single person's fault. GitLab's CEO personally apologized for the outage, and GitLab never accidentally deleted their production database ever again.
This incident serves as a stark reminder of the importance of robust backup procedures, careful command execution, and thorough testing in maintaining the integrity of production databases.