Retail And Data Analytics

This chapter examines the collection, storage, and uses of retailer data with a particular focus on point-of-sale systems.

Market Basket Data

Market basket data is typically collected in a transaction log, which is collected at each register and stored in a central database. The individual transaction records are very similar to customer receipts, and are divided into three sections:

  1. Header Data - Identifies the date, the store, and the employee
  2. Detail Data - A list of the products sold, including price, quantity, and UPC.
  3. Tender Data - Indicates the method of payment (cash, check, card) and associated data.

A few other notes:

Finally, the data collected in the transaction log is minimal, as the UPC can be used to cross-reference information in other data systems such as the SKU, description, merchandise category, and other data pertaining to the items purchased for the purpose of analysis.

Data Storage Basics

A wealth of data is available to retailers in the present day. For example, grocery stores are using RFID tags to track the movement of inventory inside their stores - which tells them what physical path the customer takes from the store, the order in which items are added to their basket, items that are removed from the basket, and the like. In the online channel, it's possible to track every click the customer makes. All of this information is stored in databases.

Point-of-sale databases are typically very large, and since market basket data is fairly standardized, there are a number of vendors whose solutions are leveraged: NCR, IBM, and Oracle all have standard solutions for retailers of various sizes.

Because of the enormity of data, databases are generally based on relational models that use cross-referencing to minimize the amount of data per record (per the previous example in which UPC cross-references more data).

The concept of "normalization" is mentioned, which requires having many small database tables instead of few large ones. Extensive cross-referencing is required to effect this. In addition to saving data space, it also reduces data contention (two queries "fighting" for the same data) and makes it less difficult and risky to update a single table.

Data "mirroring" is also common - having multiple copies of data tables on different systems so that each system has its own resource. The redundancy provides for better responsiveness and security.

The author refers to his experience at Kmart, which uses as many as 35 data tables for marketing alone. Their market basket database has over 10 billion data records. Data from both online and B&M transactions were accessible to enable them to identify patterns that were common or unique and track a customer's entire purchase history across media.

A side note: the author advocates including the database administrators in any project that involves customer data. Their high degree of familiarity with the data and the systems is indispensable.

What Data To Collect

There are two sides to the argument over what data to collect:

The first argument maintains that data is overhead - any data you collect must be stored, and storage has a per-byte cost, such that any unnecessary data is an expense. There is also the matter of efficiency: a database that is stuffed with useless information is less responsive. As such, you should be cautious about data collection and gather only what you need, discarding the rest.

The second argument maintains that data is potential - you cannot perform an analysis if you do not have information to analyze, and when you discard data because you do not have a present need, you are discarding potential future uses. As such, you should collect everything you can because it is useful, even if you may not presently know how to use it.

Case Studies And Practical Examples

The author provides a handful of examples, with the forewarning that these are used to suggest the kinds of analysis that might be done. Each business is idiosyncratic in its operations and the customers it serves, and as such needs its own idiosyncratic tools - do not consider these examples as things that all businesses should seek to precisely emulate.

Trade Area Modeling

The trade area is a geographic region from which a physical store attracts the majority of its customers.

The size of trade area values by retailer: some have a small trade area, drawing many frequent buyers from a small area (e.g., an urban restaurant); others require a broader trade area (e.g., as suburban car dealership).

The data for this analysis is based on the ZIP code collected at the point of sale of an existing store.

The author mentions the outdated method of drawing rings around a store, merely considering the distance in an even radius. It remains in use today but is far less accurate than a polygonal model, because it is more accurate.

(EN: I have some discomfort with using the customer's home location as the basis of defining trade areas, even for brick and mortar. My sense is that this also depends on the product in question, but a better analysis might account not for a person's home so much as the routes they follow from one location to another, home being a common endpoint on that route. For example, a person might be more likely to shop a grocer that is five miles west of their home if it is on their route to work, even if there is another grocer two miles east of their home - it's closer to the home, but in the wrong direction given their typical routes.)

Real Estate Site Selection Modeling

Whereas trade areas are used to analyze the locations of customers for an existing store, site selection modeling considers areas of opportunity for a new store: the retail volume of a store depends largely on its physical location, which must be chosen carefully.

Existing trade areas are useful in determining the placement of a store, and it is reasonably easy to detect where there are "holes" on the map in which there is a sizable population that is not being served.

In addition to population, demographics can be overlaid to determine whether a given area that is not being served is a good fit - based on the assumption that the new store will attract the same kinds of customers (in terms of income, age, ethnicity, and the like).

Consideration must also be given to the competitors in the area (it is easier to attract un-served customers than to draw them away from an existing supplier) in addition to considering the impact of opening a new store on existing stores in the same area (cannibalization of sales).

The author indicates that for one retailer, he was able to cut the new store breakeven from six years to two, simply by avoiding the selection of bad sites. They were also able to tailor merchandise selection and store size to better suit the needs of smaller and more defined populations, which in some cases enabled new stores to be successful in locations where a traditional one was expected to be less appropriate.

Competitor Threat Analytics

Another location-based consideration is the presence of competition. In developed nations, there is less opportunity to move into an area where there is not already an established competitor in a given format. As such, opening a store requires consideration of competitors in the area - in some instances, intentionally locating a store where the competitor already has a strong presence with the intention of capturing their customers.

This, likewise, stems from trade area analysis, in that the location and reach of a competitor are plotted on a map to determine which competitor's stores a new store will compete with in an overlapping trade area.

Transfer sales analysis is also necessary when a given retailer has multiple stores in an existing area: some customers will prefer to shop at the new store, resulting in a corresponding decrease in business to the existing one. While transferring customers are a boon to the income of the new store, they represent no gain for the retailer.

The author gives the example of Wal-Mart, which is "master" of this model: it opens stores in areas that aggressively compete with existing chains, and considers the impact of customer transfers to ensure that, in spite of some cannibalization, the profit to the firm is improved by the opening of new stores.

Merchandise Mix Modeling

Within a given store, retailers plan the merchandise mix to capture share of wallet from customers. There is potential for profit in selling additional goods to customers, and potential loss where customers must go to other retailers to purchase goods a store does not carry and purchase from these other retailers goods that might otherwise have been purchased at the operator's store.

In merchandizing, much attention is given to the placement of products on shelves: when two products are placed in the same shelving unit, each may boost the sale of the other. Merchandise mix modeling sales this concept to the level of a retail store. (EN: This is likely the reason many modern-day supermarkets devote an aisle to selling cookware and small appliances - you can sell more food items if you also offer the equipment to prepare them.)

Various sources of information, both external and internal, can define the affinities among products to help determine connections between merchandise. Market basket data, both internal and from research firms, demonstrates where products are purchased at the same time. More extensive research is necessary to determine products that might be compatible, even through they are not in present inventory.

The author provides an (intentionally) convoluted example of the ways in which various data and analyses are combined in planning the merchandise mix, which includes consideration of the shoppers in the present store with its present merchandise mix along with consideration of what might be attractive to shoppers who do not presently visit the store and products they do not presently purchase, but might. This includes analysis of the market basket, credit and loyalty data, shopper surveys, geographic and demographic data, seasonal patterns, etc.

Specific mention is given to turnover: a retailer operates most profitably when it can sell an item to a customer before the invoice is paid, and avoiding locking up capital in unsold inventory. Having the right products, in the right quantities, on a just-in-time basis is critical.

The same is a consideration for logistics: just as a store is profitable when its inventory turns quickly, so does a distribution center wish to turn inventory as fast as possible. However, there is also the consideration of transportation costs: logistics is most efficient when the trucks make as few trips as possible to a store to resupply it with merchandise.

Wal-Mart is named again as an expert in logistics management, building distribution centers around a geographic base, then extending stores around the DC, and managing the flow of inventory with impressive efficiency. This requires powerful and streamlined information systems to orchestrate the movement of goods.

Celebrity Marketing: Tracking Effectiveness

The author mentions Kmart specifically as a firm that is eager to have celebrities associated to lines of products - this may be celebrities who have their own line of merchandise, or for whom the store selects merchandise for them to endures.

A specific incident is mentioned where one celebrity appeared at as store opening, doing product demonstrations and interacting with customers. This seldom fails to draw a large crowd, and the analytics department can attribute spikes in the demand for products to celebrity appearances.

Another example is a celebrity who periodically appears on talk shows and always mentions her line of clothing. Naturally, sales spiked after each such appearance, and the firm was even able to recognize that certain appearances had a more significant effect than others.

Another example surrounds sports marketing through NASCAR. Kmart can track increases in store sales as well as the increase in specific items that are mentioned on the vehicle, as a method of validating and refining the results.

This last one was particularly difficult because of the many factors involved: the size and placement of decals, the racetrack venue, the amount of time the vehicle was shown, and the like.

One particularly fortunate incident was when the pit crew was photographed using the store-branded tools and supplies - which was particularly beneficial to the sales of those items, and might have been disastrous if the photograph showed them using competitor's items.

House Brand Versus Name Brand

It's a proven principle that, in the great majority of instances, customers will not pay as much for house-branded merchandise at they will for name-brand merchandise.

While there is differentiation between the types of customers who prefer one or the other, the difference to the retailer is the same: it will make less revenue from house-branded items, but more profit overall as the cost of stocking these products is lower.

As such, the decision of how much of a discount is necessary is delicate, and based on data analysis. The simplest analysis considers inventory levels (how much moves at a given price in aggregate), but it can also be coordinated to market basket analysis and, particularly, the analysis of individual customers choices (at what price customers who normally buy a name brand switch to a house brand) and how this impacts the frequency over time.

This is fairly straightforward for products that have a frequent purchase cycle (staples items in grocery stores) but the analysis becomes more erratic for less frequently purchased items.


There has been dramatic growth in the online channel, which has enabled retailers to reach a broader market than brick-and-mortar. In this channel, much more granular data can be easily gathered, such that each click is a traceable event. This results in an enormous amount of data, so efficiency of management is critical.

While ecommerce enables firms to more closely monitor the behavior of users in the online channel, it has been difficult to aggregate online and offline sales to determine, for example, the amount a given customer spends in each channel. As such, cross-channel analytics is an area of great interest.

The practice of aggregating online and offline transactions sounds simple, but can be extremely difficult. In addition to having the infrastructure (customer database) that will distinguish each shopper's identity, you need some method of identifying a customer, such as using the credit card number (which assumes a person has only one) or ship-to address (which doesn't distinguish the members of a given household).

The key to successful cross-channel analytics is a central repository of all customer information, to which the details collected in a transaction can be correlated to correlate, inasmuch as can be, purchases to customers across all channels.

There is a clear distinction in consumer preferences as to which items they choose to purchase online (items that are not needed immediately, are non-perishable, costly enough that shipping is seen as negligible) and those they continue to purchase offline (immediate need, perishable, low unit cost), but this can also be highly idiosyncratic (items such as apparel fit the criteria for online, but many customers still prefer to buy them offline).

Cross-channel analysis enables retailers to determine which channel is most effective and desirable to their customers and manage their marketing and operations accordingly.

An aside: the store environment gives retailers the ability to sell impulse items more effectively, whereas the online channel depends on intent to buy: hence knowing whether a given item sells due to intent or impulse is important to avoid false impressions.

Switch to ecommerce itself, some random bits:

Switch back to analytics: Web analytics was initially done to inform site operators of the technical performance of their site - to make decisions related to bandwidth and equipment to accommodate the traffic to a given site. To some degree, traffic was measured with an interest in user behavior, making a site more usable and more popular. It wasn't until ecommerce that there was a financial reason to attempt to understand the behavior of users and influence the design of a site to encourage purchasing behavior.

The author speaks of A:B testing (which he calls "champion/challenger") to determine click through: the classic example of making a button red or green to determine if color had an influence on customer behavior. He also strays across the topic of statistical regression analysis, which accommodates more factors and is more suited to real-world phenomena.

It becomes increasingly difficult to track the behavior of real-world customers in multiple media - where a win for the online channel may be countered (or outweighed) by a loss to other channels. For example: an online loyalty program may transfer sales from the store channel and lose the impulse purchases made by a brick-and-mortar shoppers who are led to prefer the online medium.

The author mentions discount card programs that use a customer ID number at the register and online as a way to track customers who choose to participate. This enables retailers to track customers through all channels, observing their behavior across channels and measuring the effect of changes in one channel to another.

He also speaks to a phased approach where the customer ID is introduced for limited purposes and later expanded to other uses. While firms may seek to roll out on a wider basis, an incremental approach enables them to do so more fluidly while gaining increasing benefits over time.

It is also adaptable to serve very specific needs. While the overall objective is to improve sales, an initiative cannot be taken on such a broad scale, but will likely be addressed on a project basis, with each project addressing a specific break point (getting more shoppers to the store, discovering why they bail out in the ordering process, etc.)

There is also the matter of change management for the customer: if a site's ordering mechanism is completely redesigned, its impact will be dramatic (and possibly dramatically negative) and there will be no indication of what aspect of the change was successful or unsuccessful. By running a long series of A:B tests and adjusting individual elements, the store can ease customers from the present design to a future one and make adjustments much less intrusively.

Affinity and Loyalty

Affinity merchandising identifies products that consumers are likely to purchase at the same time, often based on existing behavior (items that often appear in the same basket). With additional effort, you can also identify products that are purchased by the same customer over a longer period of time.

This data can be leveraged to identify opportunities to increase sales. A simple example is hanging candy in the cosmetics department (knowing that while mom shops for lipstick, junior will grab the candy) or it may involve the arrangements of departments within a store, determining the merchandise selection, or opening a store in strip mall with other merchants.

Traditionally, there is very little diversity in the layout of stores: clothing stores and grocery stores have the same departments, arranged in a similar way, largely based on operational efficiency.

The author mentions an experiment he did that was designed to place affinity items closer together, and his results were a 20% increase in sales with a 22% reduction in the number of items being carried. He suggests another experiment in a hardware store where it was found that affinity between products was seasonal.

(EN: There is a counter-argument: A bit of inside information I picked up in the grocery industry indicates that forcing the customer to go to different aisles to purchase products that are commonly bought together makes the customer travel through the store, and that they will be more likely to purchase more items as they pass through more displays. It's purely anecdotal, but sounds reasonable.)

The author describes the basic technique, which is to review the detail data (items sold) form receipts to determine when items are purchased at the same time, and extending this over time by combining receipts from individual shoppers to detect behaviors and patterns.

He crosses over a bit into loyalty marketing, which deals more with behavior over time rather than items purchased together, but it's largely the same: you can analyze past behavior to make a fairly accurate prediction of when a given customer will repurchase a given item, which enables more effective promotion. It also enables a firm to isolate and cater to the customers who generate the majority of store profits.

Market Basket Analysis Examples

The author provides a number of brief examples of analyses of affinity and loyalty done with market basket data:

Store Departmental Cross-Selling

On a broader scale, market basket analysis can be used to determine the arrangement of departments within a store. He suggests that, by way of example, that it is considered less intensely than it should be: one department store manager had totally redesigned their apparel area at a cost of "several million dollars" and engaged him afterward to do an affinity analysis to explain why sales had not dramatically increased - specifically, they had not engaged him beforehand, and had never even seen an affinity analysis before. The lack of consideration harmed store revenue and required significant changes at additional cost to undo the damage done by an arbitrary and uninformed decision.

Single Category Affinity Analysis

The author looks to a study he did for Kmart, specifically investigating which items were sold most often with paper towels, to see if this staple low-margin product could be leveraged to increase the sale of other products.

His first observation was that the advertising department discounted a broad selection of merchandise in its sales promotions, and considered that the effectiveness of the advertising would be increased by discounting items that are frequent purchased together.

He also observed that the merchandise group was divided into silos, each of which was myopic: a buyer was interested in the sales of the items he managed and entirely indifferent to the impact on sales of other categories. (EN: I sense a "blame the people" perspective here. I expect that the problem is poor management provided incentives for individual performance and created rivalry among groups on the misguided principle that internal competition is healthy.)

For the analysis, the author pulled all basket data containing paper towel SKUS, distinguishing between when they were sold at regular or reduced price. The findings were:

The big win, which might seem obvious in retrospect, is that the store did not need to discount affinity items when the primary product, paper towels, was advertised - and doing so was giving away "millions in unnecessary markdowns" because people would likely purchase the related products at regular price.

Checkout Register Impulse Items

Another observation was that the impulse items that are merchandized near the cash registers sharply decreases in the weeks leading up to Christmas. As Christmas is the busiest shopping period of the year, this was seen as a major loss, especially since impulse items tend to carry higher margins.

To investigate the issue, the author looked at market basket data during the season in question, with a keen eye for items that fit the physical requirements of register impulse displays (small items that fit the racks and shelves near a register) that had high volume but not a high degree of affinity to other items in the market basket (a defining characteristic of impulse buys).

By doing so, he was able to identify three items that were popular impulse buys during the holiday season: disposable cameras, four-pack cellophane tape, and 12-packs of double-A batteries. (EN: I see strong correlations between these items and seasonal behavior: people take photos during the holidays, need tape for hanging decorations and wrapping gifts, and need batteries for small electronic items given as gifts - so for the last two, at least, I think some affinity may have been overlooked.)

Placing these items near the register during the holiday season resulted in "millions of dollars" of additional sales during the holiday season (in a national chain), and the results were validated against control stores that did not stock the items near the registers.