I recently joined my team while troubleshooting a complex infrastructure problem affecting our community EHR hosting private cloud.
From years of experience doing this, here are my lessons learned.
1. Once the problem is identified, the first step is to ascertain the scope. Call the users to determine what they are experiencing. Test the application or infrastructure yourself. Do not trust the monitoring tools if they indicate all is well but the users are complaining.
2. If the scope of the outage is large and the root cause is unknown, raise alarm bells early. It's far better to make an early all hands intervention with occasional false alarms than to intervene too late and have an extended outage because of a slow response.
3. Bring visibility to the process by having hourly updates, frequent bridge calls, and multiple eyes on the problem. Sometimes technical people become so focused they they do not have a sense of the time passing or insight into what they do not know. A multi-disciplinary approach with pre-determined progress reports prevents working in isolation and the pursuit of solutions that are unlikely to succeed.
4. Although frequent progress reports are important, you must allow the technical people to do their work. Senior management feels a great deal of pressure to resolve the situation. However, if 90% of the incident response effort is spent informing senior management and managing hovering stakeholders, then the heads down work to resolve the problem cannot get done.
5. Remember Occam's Razor that the simplest explanation is usually the correct one. In our recent incident all the evidence pointed to a malfunctioning firewall component. However all vendor testing and diagnostics indicated the firewall was functioning perfectly. Some hypothesized we had a very specific denial of service of attack. Others suggested a failure of windows networking components within the operating systems of the servers. Others thought we had an unusual virus attack. We removed the firewall from the network and everything came back up instantly. It's generally true that complex problems can be explained by a single simple failure.
6. It's very important to set deadlines in the response plan to avoid the "just one more hour and we'll solve it" problem. This is especially true if the outage is the result of a planned infrastructure change. Set a backout deadline and stick to it. Just as when I climb/hike, I set a point to turn around. Summiting is optional, but returning to the car is mandatory. Setting milestones for changes in course and sticking to your plan regardless of emotion is key.
7. Over communicate to the users. Most stakeholders are willing to tolerate downtime if you explain the actions being taken to restore service. Senior management needs to show their commitment, presence, and leadership of the incident.
8. Do not let pride get in the way. It's hard to admit mistakes and challenging to acknowledge what you do not know. There should be no blame or finger pointing during an outage resolution. After action debriefs can examine the root cause and suggest process changes to prevent outages in the future. Focus on getting the users back up rather than maintaining your ego.
9. Do not declare victory prematurely. It's tempting to assume the problem has been fixed and tell the users all is well. I recommend at least 24 hours of uninterrupted service under full user load before declaring victory.
10. Overall, IT leaders should focus on their trajectory not their day to day position. Outages can bring many emotions - fear for your job, anxiety about your reputation, sadness for the impact on the user community. Realize that time heals all and that individual outage incidents will be forgotten. By taking a long view of continuous quality improvement and evolution of functionality rather than being paralyzed by short term outage incidents, you will succeed over time.
Outages are painful, but they can bring people together. They can build trust, foster communication, and improve processes by testing downtime plans in a real world scenario. The result of our recent incident was a better plan for the future, improved infrastructure, and a universal understanding of the network design among the entire team - an excellent long term outcome. I apologized to all the users for a very complex firewall failure and we've moved on to the next challenge, regaining the trust of our stakeholders and enhancing clinical care with secure, reliable, and robust infrastructure.