This is a general outline of an incident response playbook to handle and recover from a production outage. This document won’t be perfectly applicable to your own production system, but can be used as a starting point to write your own incident response playbook.
We will start with an outline of what we consider the production system for the purposes of this document. These assumptions have been selected with the intention of being very broad such that this document is valuable to a larger audience.
https://some-website.com
https://api.some-website.com
https://third-party-api.com
We are also going to assume there is an actual and obvious outage. Users are complaining about the site being down, and a quick check by several employees around the world have confirmed the website is down. A whole other blog post could be written on the topic of simply verifying an outage is real.
Create a war room in the form of a video conference call and invite people into it. Then call their phone numbers. When they pick up, do not attempt to explain anything, just tell them, “I need your help. Production is down, please get into the war room I invited you into”
Who should be invited and how many people depends on size and structure of your organization, but at minimum invite,
Make it clear who is leading this incident response. All information should be funneled to this person to allow them to make decisions. Hopefully it should be clear who is the best leader, if two people are equally qualified, just pick one at random, it doesn’t matter. If nobody can lead, then go back to step 1 and call for more help.
There are two forms of bleeding. Your users suffering and the production system falling over. Both can be addressed by putting up a maintenance page. If possible inform the user with a matter of fact message. Do not attempt to defer or assign blame, as that may backfire later. Just state the facts, “There is a production issue and we are looking into it.”
The maintenance page also stems the flow of traffic to the production system. This will either resolve the issue (production system was being overloaded with traffic) or reduce the noise (problem is more obvious if there is no traffic yet a system is in a failed state).
The leader should assign somebody to put up the maintenance page. In general the leader should avoid doing anything that can be assigned to somebody else on the call. The leader should be focused on gathering information and disseminating decisions.
If the outage is resolved simply by putting up a maintenance page, skip down to step 6.
Using the architecture diagram of the production system, we need to figure out which part is failing. We can follow the network traffic to see it tracks with the issue. This can be done top-down or bottom-up. Both directions can be run simultaneously if there are enough people in the call.
Follow the issue starting from the user’s perspective. Open the browser’s dev tools network panel and look at which requests are failing. From here we can already determine if there is an issue loading the SPA assets from the CDN or if there is an error with the backend API.
Let’s say there is an issue with the backend API returning HTTP status code 500, you now have more information. With this information you may need to update the maintenance page with more information and/or get backend developers into the war room. With this new information, we drill down into the next level to the load balancer.
Open the load balancer console. Has the load balancer failed? What does the load balancer see as the health of all the application servers? Let’s say the load balancer is reporting all application servers are unhealthy, you now have more information. We can now rinse and repeat following the network traffic from the user facing entry points all the way to the underlying infrastructure.
We can also tackle this from the other direction in parallel.
Follow the issue starting from the infrastructure perspective. From our sample production system, we can either pick the CDN, database, or third-party API. From experience you’ll be able to gravitate towards the problematic ones. Let’s say we start with the database.
Look at the status of the database replica set. Is there a proper primary? Can the primary communicate to all the secondaries? Do the secondaries agree who is primary? Look at the database metrics. Are any of the database nodes failing? Is the CPU load abnormally high on any of the nodes? Can you connect and query all the nodes? Did the database run out of disk space?
If you find anything suspect, immediately inform the leader. But do not take any corrective action. A response needs to be coordinated by the leader or else you may make the situation worse.
If there is nothing wrong with the database, look at other low level infrastructure. If everything seems to be functioning correctly, move one level up to the application servers. Coordinate with the leader, especially if the top-down investigation is being done in parallel. It may mean you should switch to a more likely cause. For example if the issue is determined top-down to be loading of the assets from the CDN, then there is no point in spending time investigating the database.
In general bottom-up is a hail mary and could be a waste of time. But if one can intuitively jump straight to the issue, they can save a lot of time.
Once you have determined which part of the production system is causing the outage the leader should shift to deciding on corrective action. Avoid the temptation in investigating the true root cause, that is later for the incident post-mortem. Our immediate goal is to return production to service. Once we have determine which part of production is failing, we can generally take corrective action.
Can we isolate the failure? Does this part of the system have a maintenance mode? Can use we feature flags to turn off that part? Can we change the data such that this feature goes unused? Can we open circuit breakers? Was there a change made this part of the system? Is it safe to rollback that change?
These actions may not fix the entire issue, but if we can go from a full outage to a partial outage with one broken feature, then that is a decision the leader can make.
If there is no existing way to isolate the feature, one can be hacked in. The product manager and software developer would need to work closely with the leader if this emergency code change is needed. Hack can be as dirty as they need be, just ensure they are safe and not making the problem worse. Isolate user flows, put up internal maintenance messages for the particular feature, or even simply disappoint the user by redirecting them back to the homepage. The level of severity of the hack needs to be balanced against the severity of the outage. The leader can decide if disappointing a few users with a bad experience is worth bring the rest of the site back up.
If there are no other better options, there are a few hail mary corrective actions that can fix the issue. Have you tried turning it off and on again? Simply restarting a server may correct the issue as it generally puts the server back into a known good state. But don’t be surprised if they immediately go back into a failed state.
If the corrective action is taking a long time, add the appropriate update to the maintenance page to keep your users informed. Also consider escalating the issue by pulling in more people into the war room, especially if more people can help either implement fixes or do investigations in parallel. In the extreme if the outage has been ongoing for several hours, consider switching people out. People need to eat and sleep.
Restoration to full service does not need to happen immediately if the rest of the site is running. If possible defer restoration to normal working hours and end the war room.
Restoration of the failed system needs to be done carefully. Some systems need to be warmed up and will fail if given full traffic suddenly. Often this is caused by cold caches. Consider other dependent systems. The failure of a system may invalidate assumptions other system depending on it. For example, if other systems assumed they simply needed to connect once at server startup, they may not see the restoration of the failed system. You may still experience knock-on issues.
After everybody involved has been well rested, but before they start anything else make sure they submit all the information they gathered during the incident. This will assist in the writing of the incident post-mortem. The post-mortem is the critical first step to ensure we are better off than we started from this incident.
We need to understand the root cause of the outage and take actions to prevent it from ever happening again. A more formal root cause analysis may be required. Outages due to lack of funding need to be escalated. The outages cost the business money and if the business can pay to prevent outages to save money, it is a no brainer.
Does maintaining production uptime excite to you? You’re in luck, Battlefy is hiring.