You can’t prepare for every “black swan” event – consider the current supply chain disruptions impacting the holiday season and creating inflationary pressures. Even planned technology upgrades or simple configuration changes can have catastrophic consequences.
SkyWest recently reported in its quarterly earnings that migration of critical systems to a newly built server in October resulted in a server outage. This IT issue resulted in a cancellation of 1,700 flights, disruption to other major airlines and thousands of passengers, and a potential loss of $15 to $20 million.
By their nature, disasters – especially black swan events bought on by the pandemic – are not easy to predict. But as an IT leader, you can better prepare for them and reduce the business impact by focusing on three key areas: enforcing change management controls, managing risks, and ensuring business continuity governance.
1. Enforce change management controls
Change management controls are the subject of many audit findings for publicly traded companies. It’s often easy to approach this from a “check-the-box” mindset to just appease internal and external auditors. Yet even one poorly managed, untested, or unauthorized change could have an adverse, material impact.
To mitigate potential internet outages due to configuration changes, system changes should include the appropriate risk-impact assessment, planning, testing, approval, documentation, automation, and communications strategy. Fully test all changes before implementing them into production, and be careful to not impede the pace of innovation – rightsizing the risk-impact assessment and testing is critical based on your company’s culture, industry, and risk appetite.
Change management controls should be part of any software development or configuration process – whether you use waterfall, agile, or DevOps. This includes appropriate segregation of duties (SOD) controls, which apply in “break the glass” emergencies. Developers may need emergency access to a production environment where they may not typically have access rights.
Many cloud providers provide status page reports on platform outages. Ensure that your teams subscribe to these status pages and that all contracts include appropriate clauses specifying that providers will notify teams of any major planned upgrades or issues in a timely way.
[ How can the DevSecOps approach help? Get a shareable primer: What is DevSecOps? ]
2. Conducting risk assessments and business impact analysis
Whether it’s a social media outage or airline booking systems that were not accessible because of a service outage, risk assessments – both internal and external involving key technology providers – can help identify risks before they materialize into disasters. A risk assessment is a part of a risk management program that identifies threats and vulnerabilities to assets used in achieving business objectives.
Have your team determine the likelihood of risk occurrence and the potential business impact if a risk occurs – bearing in mind limits on resources, time, and budget. Business impacts could include financial, reputational/brand, customer, legal/regulatory, and operational impact categories.
Once risks are identified and impacts are evaluated and scored, implement an appropriate risk response. This includes risk treatment options to accept the risk, mitigate the risk with new or existing controls, transfer the risk to third parties – often with insurance or risk sharing, or avoid the risk by ceasing the business activity related to it.
A risk assessment can be coupled with a business impact analysis (BIA) that provides input into business continuity and disaster planning. A BIA identifies recovery time objectives (RTOs), recovery point objectives (RPOs), critical processes, dependence on critical systems, and many other areas. It gets to the 80/20 rule where rather than create costly recovery strategies for 100 percent of all critical business functions, you want to focus on the 20 percent of the business processes that are the most critical and need to be recovered quickly in a disaster event.
Once a BIA is completed, organizations can determine their recovery strategies to maintain continuity of operations during a disaster. Business continuity plans should be based on the BIA and updated at least every year. Disaster recovery plans to recover applications, services, and data centers should be documented, tested, and maintained.
[ Strong leadership is essential during challenging times. Read also: 4 IT leadership tips for turbulent times. ]
3. Establish governance for business continuity management and crisis communications
Finally, establish the appropriate governance for business continuity management (BCM). Tone at the top matters when it comes to placing the appropriate emphasis on organizational structure, roles and responsibilities, policies, and funding for BCM initiatives.
Governance includes involving appropriate stakeholders in BCM and clear crisis management planning. Crisis management should include crisis planning, crisis response, and crisis communications. Think about what information employees, board members, customers, vendors, and media should know, and assign the appropriate spokespeople to address the issue.
It’s also a good idea to prepare statements in advance that can be revised as updates and facts come in – this is how companies can uphold transparency and honesty without causing any unnecessary alarm.
As much as you may try, nobody can prepare for – or predict – everything. COVID-19 is a black swan event of unprecedented magnitude, global scale, and impact. However, with the right focus on change management, risk management, and governance, you can help your businesses better prepare for the next major disaster.
[ How do containers and Kubernetes help manage risk? Read also: A layered approach to container and Kubernetes security. ]