jim.shamlin.com

8: Conducting a Post-Mortem

Following any event, from a minor unplanned outage to a major disaster, there should be a review process so that the organization may learn from the incident, with a goal of learning how to prevent recurrence and/or how to better prepare for the next event so that recovery can go smoothly.

(EN: It is not mentioned by the author, but it seems important to note that a post-mortem should not be a drumhead trial that results in people being disciplined for their "mistakes" during a crisis. If people fear retribution, they will cover up any mistake or actions that were not by-the-book and the company will not have a realistic assessment of what worked or did not work with the plan. In many instances, the findings of a post-mortem are that everything went as well as could be expected)

The author suggests a three-step process to create a post-mortem report: discovery, analysis, and corrective action.

Discovery

The first step in the post-mortem is collecting information to determine what went wrong and what was done. Generally, this involves gathering data from the systems that were involved and interviewing the individuals, looking over logs and problem reports, and arranging the sequence of events of a timeline. A "ticket" system for technical support provides a great deal of useful information.

The author notes that information should be gathered from employees, customers, and vendors to get a variety of perspectives. An interview can be subjective and memory unreliable, so it helps if they can provide hard data.

When collecting data, any detail may be of importance: a log-entry that seems trivial may turn out to be of great significance, so do not gather evidence wit an eye toward what you expect the outcome to be: just collect the information for analysis at a later time.

The author suggests an "event timeline" that provides an identification number, date/time, event description, and additional details. Of specific importance are events that are obviously out of the ordinary - things that were operating as normal are generally not the cause of problems. It is also important to determine the contributing factors to a problem, as these merit more attention than others

(EN: The author is getting ahead of himself - deciding what is "unusual" or a "contributing factor" requires judgment based on a conclusion about what caused the problem, which should be avoided in discovery so that the evidence-gathering processes is not prejudiced to support a foregone conclusion).

Analysis

One the data is gathered, you can begin the process of analysis to discover the root cause of a problem. To begin, consider the event timeline and categorize each event:

  1. Assign a significance as "problematic" (likely to have contributed to the disaster), "positive" (likely to have mitigated the damage), or "unsure."
  2. Indicate the conditions under which the event occurred (including the occurrence of earlier events)
  3. Determine if there is a causal connection between the conditions and the event

Once this is done, a chain of causation should become evident: some conditions caused problematic events, these events created conditions that caused other events, and the sum total was a disaster. The conclusion should be that if certain of these events did not occur, the disaster would not have precipitated.

(EN: per my earlier comment, this presupposes the work of sorting out what is "significant" was done as part of evidence gathering - it should instead be the first step in analysis, and should be coupled with the assessment of whether each piece of evidence is good or reliable before replying upon it in analysis).

Corrective Action

After analysis is performed, measures should be taken to prevent the same disaster from happening again. Obviously, the root cause should be addressed, but you should also look at other factors that increased the severity of the problem or could have (but did not) decrease the severity. These should precipitate changes in procedures, systems, etc. that would improve the ability to detect, prevent, and respond to similar conditions in future.

(EN: The author omits a significant element: identifying the things that operated properly and the things that actually did mitigate the problem. Some analysts see these as irrelevant and find little merit in patting oneself on the back - but in my experience, these notes underscore the value of existing mechanisms, ensure they are not changed, and help to mitigate the political reaction to crisis, which tends to be to seek a scapegoat for what went wrong and fail to recognize what went right.)

The author refers to the Software Engineering Institute's Capabilities Maturity Model, which assesses an organization's IT support abilities with an eye toward identifying whether processes are sound (sound, in their definition, being industry standard rather than unique, with built-in metrics and a history of being changed and improved) and results at a rating of 1 (a completely ad-hoc and chaotic process) to 5 (processes in place and continuously being improved).

(EN: I don't' really buy off on "maturity" models. While it's generally true that it's better to have a process in place than to rely on people to figure things out on the spot, "maturity" does not equate to "effectiveness." Process can be stifling when carried too far, and especially when an organization is constantly changing its processes for the sake of "improvement," things get painful. When the alarm bells start ringing, it's best to have a simple procedure that everyone knows and than a highly complex one that is changed so often that people don't know what it is from day to day.)