Watching the oil pour into the gulf was painful on several levels. It was awful to watch the seemingly endless spewing at a mile under the oceans surface with several frustrating and feeble attempts to resolve the issue — it was unclear who was driving towards a solution and what was being done and it seemed to take a really long time to resolve. BP’s stock price took a hit during that time, and damage was done both to the environment and the global reputation of BP.
Anyone that has had the unfortunate experience of being involved in a major systems outage probably found the process especially painful because it had many of the same characteristics: fingers of blame pointing in all directions, many theories and solutions being posited, an escalation of criticality, and people losing their jobs or being sent to run the operations in a distant enclave.
A report was released today that highlights some of the findings to “bad management” and cost cutting which led to a series of unfortunate mistakes. Decisions were made along this chain of events that had they been different would have avoided the explosion and a major finding in the report is that the companies involved failed to recognize and appreciate risk.
The report highlights systemic issues that are more than just oil industry concerns: they are concerns with any mission critical engineering effort. The key issue being: should we do this right or should we cut some corners? This balance of knowing where to bolster the design and where to cut back is seemingly more art than science but should be factoring in two key ingredients: risk and efficiency.
A major systems failure or a major confidential data leak is similar event within the banking industry and the BP can provide several lessons within this context. Some of the transferrable learnings include:
- Understanding the risks of seemingly mundane technical details. Yes, I know it can be boring — but architecture’s role is to discuss the options and alternatives in terms of implications and make recommendations based on risk impact and probability.
- Accountability for design. Someone on the team that understands the world on the ground should be accountable for how it is all going to fit together and help with some of the difficult discussions around soundness/cost/design. This person should have track record, some authority to speak up when things don’t look right, and the wisdom to only stop the train when real emergencies exist.
- A design risk management framework: some decisions have very low impact should they go sideways. Others are much more consequential — an adaptive process for design review should filter out the significant decisions and ensure that the implications and risks have been drawn out and weighed.
- A lack of traceability of design decisions both upstream and downstream: the most significant outages and failures that I have seen are from chained rather than single events — drawing out these relationship and understanding risks and impacts is a key element of design.
Essentially, the insights from the BP presidential report highlight that an emphasis on safety and soundness in the design process and balancing cost and risk is critical for any critical engineering effort. This is a very painful lesson for BP, I’m sure, but hopefully other companies and industries can learn from this?