12 - Analyze Your Test Results

The author refers to a few common fallacies that are held about statistics. For example, the gamblers fallacy leads man people to believe that something that has not occurred is more likely than random chance to occur (the flip of a coin is still 50/50 regardless of whether it has come up heads five times in a row). This is only one of a plethora of superstitions and misconceptions among people who feel they are competent in statistics, but have never taken a single class in the subject.

(EN: This is a very serious problem in the modern age, as it's the collision of cultural dysfunctions. People accept common practice as being right, and wish to be appreciated by others even when they are obviously wrong about something. Add those to a media that often uses statistics to aggrandize and misrepresent facts, and you have a culture that is obsessed with numbers and lacking in basic knowledge of numeracy.)

The most thorough investigation can be undermined by poor statistical interpretation - and while it's beyond the scope of the present book to teach statistical analysis, the author feels that providing some basic information will help readers to recognize and avoid common problems in analyzing test results.

Embrace Not Knowing

Taking a scientific approach requires a high degree of humility. It's not about knowing all the answers - but about knowing how to find the answers. An analyst must be comfortable saying "I don't know" and "that theory should be tested" rather than delivering a fast answer with the appearance of confidence.

It's also necessary to set aside your preferences - the idea that seemed the best before testing is often surpassed by another option. Ultimately, what's "Good" is not a matter of opinion, but is demonstrated by functional results.

Once testing efforts have been done, there will again be the expectation that your experience gives you all the answers - that through testing you now know what works and what doesn't in every instance. And again, you'll need to control expectations and your own ego to answer accurately, admitting that even if you have tested something similar, the results may not apply in all instances, and testing is necessary to confirm that a finding is portable.

Monitoring Tests

While a test is ongoing, there is a constant threat of premature evaluations. An eager sponsor will see early results and wish to pull the test and implement a solution, whereas a timorous one will wish to pull the test if the needle twitches in the wrong direction. Ideally, you should wait for statistical significance before taking any action.

Once a test has begun, the best practice might be to take a break and refrain from watching the pot. It won't boil any faster, and constant attention is constant temptation to fiddle with it. Set expectations of when results will be available, and report nothing until such time as they are.

In an impatient culture, it may be worthwhile to be more modest in testing - to plan tests that run for two weeks (so the proportion of weekend/weekday behavior is accurate) by testing fewer factors or using a lower level of confidence, which is a tradeoff of accuracy for the sake of speed.

It may also be necessary to design a test to include a kill switch - identifying criteria in advance (rather than as a gut-feel reaction to seeing data) that will lead to turning off the entire test or eliminating underperforming variations. Btu be aware that in doing so you may be smothering a potential winner in the cradle.

(EN: The author provides no guidance. But going by what he's said in general, it might be best to set a checkpoint at the 80% level of confidence to thin the herd, as a variation that's far behind the pack at that point is unlikely to pull ahead by the time 95% confidence is reached.)

Keep in mind that there is not always a winner, and the outcome of a test may well be that it makes no difference. The notion that running the test for longer might compel the differences to appear, but statistically speaking that is not likely: the "confidence" in statistics reflects the confidence that the figures are universal.

Evaluating Results

A change seldom has a singular result, and a winning option may have positive results in some regards but negative results otherwise - for example, a design change gets more people to click "check out" but purchase fewer items in total. Whether that constitutes "winning" depends on the financial outcomes (the increase in sales exceeds the decrease in revenue per customer) or the objectives of the company.

The author remarks that statistics are not infallible - primarily because they pertain to a specific span in time. What is "proven" by a test in the spring may not hold true in winter. A button that gets more clicks may no longer do so when the page template is changed around it. Neither the customers nor the site is frozen in time. What was true during the two weeks the test was performed will not be true forever.

There is also the gambler's fallacy: a 95% level of confidence success the results will be correct 19 out or 20 times. It will never be 100%, and there is no guarantee that everyone who visits the site the following day will not be a member of the 5% of the audience who do not react positively. Ultimately, testing brings evidence to help us make better decisions, and cannot guarantee perfect decisions. Evidence at the 95% level of confidence is about as good as it gets, and testing to the 99.9% level of confidence will take considerable time.

The author also notes that a test proves only what is tested. That is to say that if a test results in 20% more subscribers to a newsletter, it does not mean any of those individuals will become paying customers, not even that a similar proportion of new subscribers will become buyers if your study has not tracked them through the entire flow.

There is also the tendency to generalize results. If more products were sold by adding a photograph to a page, it does not mean that any photograph can be added to a page to increase product sales - the results were specific to the product and photograph in question.

The author lists a number of "research facts" that people commonly promote (always put buttons on the right side of the page, reduce the number of clicks, pictures of smiling people increase conversion rates, videos don't work, etc.) and asserts that principles such as these may be the result of a single test conducted on a very specific site and audience. The fact behind them might be that "pictures of smiling people caused a 1% increase in sales on a mobile application selling shoes to teenage girls in Finland" - which is to say, the results are not your industry, not your market, not your channel, and not your customers.

Inconclusive Tests

A positive test outcome generally meets with great fanfare because it points the way for the firm to earn greater revenue. A negative outcome seldom gets fanfare but is considered a learning experience, and is celebrated in secret by the people who wanted things to stay as they are. But an inconclusive test seems to be universally unwelcome: it seems to have accomplished nothing at all for the money that was spent. The author's take is that learning when things do not matter is valuable as a process of elimination, enabling you to focus future efforts on other elements.

(EN: From a design perspective, what it also does is enable you to clear out clutter - if having something on the page does nothing for achieving your results, it can be removed ... and given that many Web sites have snowballed over the years with little additions that people assumed would be an improvement, there's often quite a lot to cut. )