jim.shamlin.com

7: When the Unthinkable Happens

This chapter deals with the topic of disaster recovery, but the author doesn't define what constitutes a "disaster" - however, it seems to be implied in the description of the concept that it involves "switching operations to an alternate service-delivery environment." Hence a 'disaster" implies that the primary service environment is completely offline (destroyed by natural disaster or even something less severe such as a server that has a fried motehrboard).

Disaster recovery plans should be created when the system is designed, so that they can provide guidance in the event that a disaster should occur. Ideally, they will cover several levels of disaster, and have contingency plans should the primary plan fail.

Assess and Declare

Every disaster starts off as a "problem," and the typical details-gathering and diagnosis determine whether it is to be classified as a disaster or deal with as a problem. It may also become evident as a recovery is attempted, and it's then found that the problem is larger than expected. Of importance is that you should be reluctant to declare a disaster until you are relatively certain that it's not a lesser order of problem.

Once a disaster is declared, communication is critical. Not only do the staff involved in the recovery need to be spun into action, but individuals who use the system need to be alerted to the problem (to prevent panic, and additional problem reports from pouring in.

Especially when it comes to online operations, you might also need to consider communicating with the public with an eye toward damage control. This is especially true if the disaster will affect a lot of people at a peak time (outages in e-commerce sites during Christmas shopping season undermined consumer confidence in e-commerce in general, and destroyed consumer confidence of several specific companies).

Enact the Recovery

(EN" The term "recovery" here is contrary to the previous chapter's definition - in which getting the system operational is merely a "resolution" and full recovery does not take place until the original systems are back to full operational status).

The author underscores the importance of a disaster recovery plan. Especially in times of catastrophe, there is a paralysis that sets in and people wonder what to do. Having a set of procedures can help to shorten he period of paralysis and decrease panic.

The author also suggests that the plan should also include an estimated cost of disaster recovery (but not why or how), as compared to the cost of allowing the disaster to perpetuate as a way of justifying the expense of recovery, which can be considerable.

(EN: In my experience, knowledge of a disaster's cost has been counterproductive, - an executive runs about screaming how many dollars are being lost each second the system is down, which increases panic and undermines the morale of the rescue team.)

It's also mentioned that the "degree" of disaster recovery has changed. Prior to the Internet, computer systems were helpful, but not entirely necessary to most businesses. Nowadays, a shutdown brings the business to a halt. The employees can't do anything, and the connection to suppliers and customers are severed.

Beyond the Recovery

In some instances, a disaster may be ongoing, and the business will need to continue to operate on its "temporary" solution for quire a long time before it can regroup (an example of an office building burned down - you're not going to rebuild it and move back in the next week).

Most businesses have a "continuity plan" for preserving their operations in case of a disaster with a protracted period of recovery, and IT is often a critical member of the planning group.

(EN: The author goes into quite a lot of detail, but it's all general and hypothetical. There are better sources of information about continuity plans.)

Loose Topics

The author goes over a handful of "different aspects of disaster recovery", which means a sort of succotash of random information.

The author distinguishes between a cold site (a space that is designated to be used in case of disaster, but not all equipment is on-site and powered up) and a hot site (a site that is fully equipped and powered up, even when not in use). Naturally, a hot site is in a greater state of readiness, and the company can resume operations more quickly, but it entails significant ongoing expense when compared to the cold site. There's some mention of vendors who provide a hot-site environment for several firms, to decrease the cost, but the problem is that if there is a widespread disaster (earthquake or flood), many customers may be affected and the facilities would be insufficient.

An "Online Ready": site refers to a location where resources (typically duplicate data, though it may contain entire duplicate systems) are located that serve as a hot backup of the main systems, and can be utilized in and emergency. The main difference is that an online-ready site is maintained by a vendor, and your personnel will not be able to enter or work on their premises, but must access the systems remotely. Because they are accessible by network, location is unimportant, and a company may have multiple online-ready sites in various locations.

The author lists a few "best practices": maintaining at least two separate data centers in different geographic locations, having multiple sites to which data is mirrored on a regular basis, and having multiple copies of systems on site (development, test, staging, production) such that, in a pinch, the business can run on its staging servers if there's a problem in the production environment.