Banking outages are not fun. Customers without banking services are not happy. The impact of a major systems outage is both direct based on deals not booked and services not provided and indirect — reputational. I probably don’t need to spend much time convincing you that banking outages are unpleasant, so let me focus more on how to recover as quickly as possible and how to avoid outages.
It is important that the focus be on recovery and restoring service: diagnosis and introspection can occur once service is restored.
Rapid recovery from major outages requires clarity and structure:
- There needs to be a map of the ground: a detailed view of how the systems connect and where they reside.
- A small war room with the most knowledgeable people in the organization needs to be assembled. Emphasis on small and knowledgeable.
- It is important that subject matter experts are “on deck” and ready to answer questions should they be needed in the war room.
If it is hard to imagine who would be involved in such a scenario or these maps of the ground don’t exist, this will delay recovery and increase the risk of hubris during the outage.
- A clear and structured recovery plan is needed — this plan should be roughly divided into analysis/identification/remediation/validation stages.
- Analysis should begin with a prioritized list by probability of the potential causes of the outage as well as an assessment of the remediation approach and risk.
- Identification is the process of confirming the cause hypothesis identified in the analysis phase by working through the list to confirm and prove the cause.
- Once the cause is identified, the remediation step is focused on taking steps to return services. It is important that all changes be tracked at a detailed level so that they can be rolled back if needed.
- Validation includes understanding the root cause and if the remediation step is a short-term fix or workaround validation also includes the planning for the long-term solution
The best way to handle major outages, however, is to avoid them altogether!
One of the greatest sources of outages is banking systems complexity. Once a system becomes complex the organization that supports it also usually becomes specialized and fragmented. It becomes difficult or impossible for one person or a small team to understand all of the moving parts and how they connect. New solutions either avoid existing components (creating sprawl) or integrate without fully appreciating impact.
In my opinion a great recipe for avoidance is complexity reduction, by methodically reducing banking systems complexity the risk associated with outages also declines.
Less moving parts, fewer things to break, decreased breakage. The answer is quite literally simple.