jim.shamlin.com

6: Delivering Your Goods

This chapter discusses on organizational issues pertaining to delivering applications and the maintenance and monitoring of the delivery environment.

The objectives of maintenance are to ensure the desired level of service (support), prevent problems that could disrupt service (preventative maintenance), and recover from any problems that may arise (troubleshooting).

Maintenance tasks and contingency plans should be developed as part of the planning process for any new system or component, and the funds to obtain (and maintain) the necessary personnel to support the systems should be built into the project budget as an ongoing cost.

Set Up the Delivery

The support for a delivery environment is discussed in terms of three factors: an organized support structure, centralized operations, and automation.

The author discusses the support structure in terms of "levels:"

One of the main "pitfalls" or support is compartmentalization: different groups or teams are responsible for a specific technology (network operations, database servers, etc.), which is a natural result of the grouping of individuals with similar skills. However, departmentalization can result in infighting and blame-shifting which will slow down the troubleshooting process (each blames the other and refuses to investigate their own demesne). It is especially problematic if a problem falls into a gray area where ownership or responsibility is uncertain.

The author proposes a method of centralizing operations through the model of naval operations, in which there is a "bridge" where the senior officers of each department can act cooperatively and collaboratively to get the members of their departments to lend a hand. This can be done with lower-level personnel (collaborative troubleshooting teams), but is often more effective if the orders come from a high level of authority.

Because of the complexity of systems, automation is necessary to monitor performance and call attention to problems (or potential problems). The automation is very similar to sensor-driven devices in the physical world (when a condition is detected, an action is taken - though the action may be to alert a person, such as in the case of a smoke alarm).

Some of these functions are built into the applications and systems: the function that receives data generally checks it and reports or reacts if it does not meet expectations (a null where a value was suspected) to prevent a malfunction, but others may need to be applied in the production environment (example: the amount of user requests approaches the system's designed limits).

While most IT departments view automated operations as a simple answer to reducing personnel cost, there are some drawbacks to automation: they consume system resources (some processing is consumer merely to monitor); they may fail to indicate a problem (they can report only what they were designed to detect); staff may become over-reliant on them (dismiss a report if the indicators are "all green" and, over time, troubleshooting skills may atrophy); and there may be complacency in future development (the notion that something "has never been a problem" with existing systems).

Another problem with automation is setting tolerances to the correct level, such that systems aren't constantly throwing alarms over minor issues that down out more serious ones, but at the same time provide some sort of alert when a seemingly minor problem occurs so that it's not ignored (and allowed to fester).

Actions to Take When Problems Arise

The speed of a service recovery is critical when problems arise, and should be monitored and measured with an eye toward improvement when necessary. A "recovery clock" illustrates the four major steps in crisis management: detection, diagnosis, resolution, and recovery. Measuring these four elements separately can help determine what part of the recovery takes the most time, and is assumed to be in the greatest need of attention.

(EN: I'm leery of analytics that imply that "success" is attributed to non-essential factors. While speed is important, an effective recovery is the goal, and if speed is put before effectiveness, it becomes counterproductive: a staff gets their response time down to a few minutes, but only by shortcutting diagnosis, so the team is encouraged to fix a problem quickly, over and over, rather than investing the time to repair it in a way that will prevent recurrence.)

Problem detection is generally the result of system monitoring, which is often an automated process that provides a constantly updated status report of system operations and benchmarks them against service-level objectives and system tolerances. There should also be periodic monitoring of system logs and error messages to determine an unusual condition (though again, signal-to-noise must be considered when designing the system to chatter about its operations). Finally, the first-level help desk personnel are a detection system that attempts to weed out user behavior from "real" problems, and report the latter as they occur.

Once detected, information about the problem must be communicated to those who can act on it. There is often a problem of having too many intermediaries, which slows the communication time, and there is also a problem of filtering, as failure to filter produces a lot of noise whereas too aggressively filtering reports results in problems being ignored.

The second phase, determination, depends on the information gathered in the detection phase to provide the "symptoms" of the problem. The greater and more detailed this information, the more quickly and accurately the diagnosis will ensue. Detail must be balanced against timeliness - taking too much time to gather detail will stretch out the timeline to resolution. The task of determining the source of the problem is investigation of the symptoms to arrive at the root cause.

A "resolution" to the problem is akin to first aid: it stops further damage from occurring, but does not fully repair the injury. In some instances, a single action taken to resolve a problem can result in a full recovery, but it is more common for the resolution to address only the immediate condition and generally does not restore full system functionality or prevent recursion.

The "recovery" occurs afterward, when the problem has been resolved and the system has been returned to its normal operating state. Arguably, "recovery" may also include actions taken to prevent the recurrence of the problem, but in most cases, a separate project is planned to implement a "permanent fix" to the problem, which is not considered to be part of the recovery effort.

Proactively Maintain the Delivery

Proactive maintenance is done on production environments to ensure future operations (rather than react to an existing problem). A good example of maintenance is installing a patch or upgraded software that improves system security (it is not a reaction to an attack in progress, but forefends against future attacks against certain vulnerabilities).

Where third-party solutions are employed (a purchased database system), preventative maintenance is generally a matter of reaction - when an upgrade or patch is released, it is installed. On custom solutions, a more proactive stance is required because you cannot rely on someone else to do it for you, which may require devoting staff to the task of analysis and prediction even in the absence of operational problems.