jim.shamlin.com

7: D Is for Data

Data is the fundamental building block for a BI application. While the analysis and reporting provide value, it must be based on reliable and relevant data in order to be successful.

Data Quality

The author refers to survey results as evidence that data sufficiency and data quality are essential to the success of BI deployments. This perspective is shared by both business and IT professionals, and the author suggests that both must take responsibility. The implication is that a BI system can "bandage" bad data by estimating unknown quantities, but ultimately, the data collection system must be designed and used to provide real data rather than approximations wherever possible.

Data quality is also cited to be one of the largest problems in the current business environments, as it leads to bad decision-making that can have negative consequences for the organization (one statistic: it is estimated that nearly 100,000 patients each year die as a result of bad decisions made due to data quality in the healthcare industry). On a less dramatic scale, it's noted that bad data undermines confidence in BI in general, and tends to make executives dismissive of the information systems in general, and that trust is hard to regain.

Data quality problems originate in the source systems. Largely, it is because systems are not designed to collect the necessary data, whether by omission or a desire for refficiency (it Is not strictly necessary to the task at hand) or there are procedural problems (the information is "optional" so people under the pressure of performance metrics do not bother to provide it).

Regarding the accuracy of data, the author does a brief aside about "six sigma," citing a source from the year 2,000 that asserted that every level of improvement results in a 10% increase in income. (EN: This figure was later found to be erroneous, and the cost of progressing beyond the four-sigma level of process control often outweigh the benefits - not that quality is unimportant, but the "six sigma" standard is impractical, impracticable, and unreasonable.) The implementation of six sigma processes involves a great deal of information collection and performance metrics that can be exploited and built upon for BI purposes.

There is also the problem of data sourcing, in that data is collected from multiple, disparate systems, which makes it difficult to consolidate and derive meaningful information on the enterprise level.

Another problem is the lack of common business definitions. Within one company, there were 33 different business definitions of "customer churn," each with its own implicit metrics, so consolidating this information was impossible until the company negotiated a common definition that could be used across the board.

Successful Data Architectures

In addition to the quality of the data itself, there must be an architecture that stores and models the it in an efficient and meaningful manner. The author mentions a few different approaches, noting that there are notably lower success rates for companies that operate multiple independent "data marts" (separate warehouses of separate areas of interest) as opposed to a single source system that feeds multiple presentation layers. Ideally, the data model should be optimized to support the analysis and reporting functions rather than an abstract notion derived from the data itself.

Master Data Management (MDM)

The concept of "master data" refers to the data warehouse format that organizes data from various information systems into a common set of tables that eliminates redundancies and draws relationships among the data as a whole. This is a difficult task that has received much attention, largely due to the recognition of the problem and the rash of vendors who have developed "innovative solutions" they wish to market.

Enterprise Resource Planning has attempted to address the problem, with some success, but developing common practices, such as using the same customer code across all systems. This has reduced complexity and facilitated information exchange, such that a single customer is represented by the same code in all systems. Data entry errors continue to exist (from typographical errors, primarily), but having all data in a single table facilitates discovery and correction - e.g., the system can discover where two customers with similar names have the same address, and flag the error for investigation.

Some changes in the architecture of business systems is required to consolidate data sources, enabling all transactional systems to access a central data source rather than keeping their own separate data tables. (EN: Taken literally, this may interfere with system efficiency - it may be necessary for a transactional system to utilize a separate data source optimized for its needs, but the data should be audited against master data, at least periodically.)

Right-Time Data

Information systems have been pressured to transition from processing data in weekly or daily batches to real-time transactions, and the standard of "real time" has been imposed or expected of all data systems in general - but it seems an arbitrary standard, and when business value is considered, there are few instances in which real-time data can be used in a manner that generates revenue or saves costs in excess of the cost of resources to provide data in real time.

Tactical decisions may need to be made in real time, and real-tome data is essential to troubleshooting, but BI focuses on making decisions of a more strategic nature, in which case periodic data update is often sufficient.

Factors to consider when determining the need for "right-time" data include the capture latency (the time between he occurrence of an event and the availability of data to the BI system), analysis latency (how long it takes to analyze data and disseminate results), and decision latency (the time it takes to make a decision based on the data).

(EN: I'd also suggest a fourth latency: implementation latency, as the amount of time it takes to implement a decision, as there's little point in enabling real-time decisions if they cannot be effectively acted upon in real time).

The author returns to research, and among the most successful BI deployments, about a third updated data on a daily basis, onle a quarter update in real time, and the remainder update on a regular interval (hourly, or every 15 minutes).

In the end, the necessary level of frequency depends on the nature of the deployment and the phenomena monitored. The example is given of a business whose customers place orders twice a month, and updating in the meantime is largely pointless (as it presents an analysis of data set that may have wide variances from the norm that are due to its incompleteness).

Data Quality's Chicken and Egg

Where "messy" data exists, companies are faced with a choice: to fix the data problems as a prerequisite to a BI initiative, or to implement BI on top of imperfect data.

It would seem to be a no-brainer: since the value of BI depends on the value of data, the results will be misleading if it is implemented onto unclean data. This can be a death knell for BI systems, as the amount of work necessary to clean the data is considerable, and no value will be delivered until the task is done. It may be possible to implement BI as a method of discovery (you may not be aware of data problems until the system is attempted), and having "bad" BI tools may provide the justification to obtain resources to clean the data.

It's also asserted that one should not take an all-or-nothing approach, but implement and socialize the tools that can be provided, utilizing the data that is reliable, to exploit some of the value of BI, and delay other initiatives until the data is clean. In effect, this will give the organization a taste for BI, and a sense of what is possible, as an incentive to undertake the effort to obtain further functionality.