When we talk about disaster recovery, most people immediately think of “smoking hole in the ground” scenarios: The data center building is engulfed in an inferno or demolished by a tornado. In reality, even small disasters can be devastating, and too often IT teams overlook disaster recovery planning.
Looking beyond fires and tornadoes, some disasters are caused by system failure, and many also result from human error: A network engineer accidentally plugs two network cables into the wrong hub. A database administrator tries to get ahead on weekend maintenance and accidentally commits new changes. Overheated systems shut down after a worker changes a data center thermostat setting from Fahrenheit to Celsius.
In planning your response, consider the “small” disasters that can be just as destructive as a data center loss: A database administrator doesn’t check the status of a backup before deleting a database. A water leak from an overhead sprinkler system damages racks of systems in a data center. A construction worker accidentally cuts into the only fiber data connection for the data center.
[ How can automation free up more staff time for innovation? Get the free eBook: Managing IT with Automation. ]
7 steps to building a disaster recovery plan
What should you include in your next disaster recovery plan? The specifics may differ depending on the systems involved, but at a high level you will want to focus on the following priorities:
1. Identify critical people and vendors
In most instances, disaster recovery is performed by the IT team, and business continuity is the responsibility of business units. Depending on the application, note that these players may overlap.
[ Do you know the difference between DR and BC? Read Business continuity vs. disaster recovery: What's the difference? ]
In the event of a disaster, immediately reach out to technology folks to respond to the disaster, internal partners to keep the business running, and external vendors to help you get back to business quickly. Disasters don’t necessarily occur during working hours, so make sure you have copies of contact information available off-site.
2. Identify critical systems and applications
Conduct a risk analysis or other prioritization exercise to determine which systems are most critical to your business and create disaster recovery plans for these systems first. For example, most organizations can let development or test systems wait a few days while you restore production systems.
One way to prioritize systems and applications is to break them into components such as the likelihood of a failure and the business impact of the failure. The combination of these components determines the criticality of the overall system.
3. Are your RTO and RPO realistic and achievable?
How quickly can you recover a system, and how old will your data be when you get it back up? These address two important factors in disaster recovery planning. RTO (recovery time objective) is how long it will take to recover your system, and RPO (recovery point objective) describes the age of the data that you can recover.
4. Design for redundancy and fail-over
As you create a disaster recovery plan for an application, look closely at the systems it connects to. What inter-dependencies exist? How does your application rely on other systems and applications?
Where possible, create an architecture that remains flexible in the face of an outage. In one example, you might run production systems from two different data centers so one production system can “fail over” to the other during an outage. As you design fail-over into your architecture, look for single points of failure, and find ways to address them.
5. What is your vendor's disaster recovery plan?
Many organizations now outsource applications and leverage cloud- and vendor-hosted systems. While this simplifies your IT, it means the cloud and other outsourced vendors become even more critical to your operations.
Don’t be lulled into a sense of security when outsourcing applications and services. Just because you have outsourced part of your systems to a vendor or other upstream provider doesn’t mean you can ignore disaster recovery planning for those systems.
Discuss disaster recovery plans with your upstream vendor to understand how they will bring your applications and data back online in the event of a disaster. Review business continuity with your internal partners to ensure the business can continue running if the vendor’s site is down.
6. How do you go back to normal?
It’s tempting to focus only on the actual recovery portion of a disaster recovery plan – for example, your plan may involve moving production to a test server. But you can’t run production on the secondary system forever. After you’ve recovered on the test system, how do you plan to go back to a normal state?
Take this opportunity to review processes and procedures in your own organization. Do you even have a disaster recovery plan, and if so, what does it look like? Document your response and review your plans with others in your business line. Ensure everyone in the organization knows how to bring systems back online and how to continue business operations in the face of an outage. If you can do both, you will set yourself for success.
7. Plan for redundancy
Planning for disaster recovery is also a good time to look at redundancy. What can you do to make your systems less prone to the failure of a single component? What happens to your applications if just one server in the process flow — web server, integration server, database server, or another server — becomes unavailable?
I look at systems through three different lenses:
- Redundancy: If you lose one web server, is there another that can take the load? Does the load transfer to that other server transparently? The best failure scenario is when no one notices that one component of the system had a problem because you had enough redundancy built in to prevent an outage.
- Disaster recovery: If you are unfortunate enough to experience an outage, how quickly can you bring things back online? What are the critical systems? Which systems are less important? Typically, the development and test environments get the lowest priority, and production systems get first attention.
- Business continuity: IT can address disaster recovery, the act of bringing systems back online. But while applications are down, how can the business continue to operate? Business continuity is, by definition, the responsibility of the business owners.
[ Are you leading through change? Get the free eBook, Organize for Innovation. ]