Imagine waking up to an sms, "we lost a cluster." On that day, with a one-line configuration change, we accidentally removed all of the compute from one of our busiest production clusters, causing a multi-hour outage. This presentation will cover the incident from the days leading up to it, to our full recovery, our customers' response to it, and how we implemented changes based on our learnings. It will go into detail about the configuration of our CI/CD pipeline, details about the specific change that caused the outage. Thankfully, we had a disaster recovery plan in place. We will discuss which parts of our disaster recovery plan worked, and critically, the few parts that didn't work. The session will cover a combination of technical and management content.
Click here to view captioning/translation in the MeetingPlay platform!