Last week wasn’t a good week for the stability of high-profile systems. United airlines had a ground-stop due to an “automation failure”, the New York Stock Exchange suffered a problem that caused an interruption of trading, and the Wall Street Journal had issues with site availability. United is blaming this week’s outage on a bad router, the NYSE is saying this was due to a technical “glitch”, and there’s no word yet on what caused the Wall Street Journal to stop working.
My guess is that several of these sites suffered a release-related outage. This is the most common source of problems in high profile and high-availability systems. Teams coordinate to push a set of changes to a production system, and due to various factors systems in production tend to exhibit different bugs than staging and QA. The worst defects always tend to be identified in production, and in a complex system such as an airline or a stock exchange it’s difficult to recreate production conditions in non-production environments.
If you work for a large company site outages are nothing new. In fact there are likely several mini-outages for most websites every week that don’t make it to the level of headline news. A large e-commerce website might stop accepting orders for a few seconds during a major release, or a site like Netflix might introduce several minutes of downtime during a major release. These mini-outages are common and while many companies strive to bring release-related outages to zero, there are some production changes that require brief downtime. My own banking provider is constantly warning me that Sunday’s might involve several hours of downtime.
I’m sure one of these outages was due to a code or configuration change related to a software release. A playbook was created to track a change, a team was assembled on a conference bridge to monitor the change, and a team of specialists was given the green light to flip the necessary switches to update software. Somewhere in the operations center of both United and the NYSE a graph started to flat-line and alarms started to go off. Executives and managers started to be woken up as the emergency became more urgent, and instead of completing a playbook an entire IT department was distracted by an ongoing emergency to get the trading network or the airline operational. This is how releases turn into emergency production support calls, and when this happens without a rollback plan that’s when headlines are generated.
Someone on some release team had to call into a conference line and inform operations that, no, the planes cannot fly and the stock exchange can’t open because someone failed to account for every possible contingency. Teams are under pressure, managers are under even more pressure, and every time you read one of these headlines you know there’s a new group of IT managers refreshing resumes. When a business is on the line, it’s the release managers that are held accountable for these failures to avert disaster.
Software releases introduce risk and this is just a fact of the industry. The way to completely avoid this risk is to stop delivering software, and that’s not something business can afford to do. Today’s business face a difficult challenge we are being asked to move faster, to support DevOps and get out of the way of developers who need to move quickly. At the same time, our world is increasingly dependent on software, and when that software fails (even for a few minutes) it can generate headline news.
For this reason it is important for release managers everywhere to understand that they are on the front-lines of uptime and availability. We aren’t just simply making playbooks and deployment plans to deliver software on time and under budget, it is our job to think through contingency plans in the face of failure and to provide businesses with a fallback plan should a release cause an unintended side-effect.
When businesses like United and NYSE suffer outages like those that happened this week it’s very likely that software delivery and/or configuration change contributed to these failures. After all of the analysis is done and the root causes have been identified I have no doubt that at the center of the problem lies a series of excel spreadsheets that captured an inaccurate plan or a series of unopened emails from a release manager sitting in someone’s Outlook Inbox.
That both the United and NYSE failures required multiple hours (and in the case of the NYSE more than a day) to remedy suggests that these organizations may lack a unified plan for reacting to unknown failure conditions. At Plutora we work with companies to facilitate release plans, but we also help companies use our tools to record playbooks that deal with disaster recovery and emergency response. There’s very little difference between the steps required to orchestrate the delivery of software to production and the steps required to mitigate an ongoing failure.
The reality of today’s enterprise releases are that they are largely managed by custom spreadsheets and lengthy email chains just begging to be ignored. If we’re going to evolve as an industry, and if release managers are going to be acknowledged for the value they bring it is time for all companies to adopt tools like Plutora so they can manage complex releases that span multiple departments in a tool that provides companies with a plan for dealing with unknown failures.
Image CC-BY 2.0 from https://www.flickr.com/photos/caribb