The one line split-test, or how to A/B all the time

Split-testing is a core lean startup discipline, and it's one of those rare topics that comes up just as often in a technical context as in a business-oriented one when I'm talking to startups. In this post I hope to talk about how to do it well, in terms appropriate for both audiences.

First of all, why split-test? In my experience, the majority of changes we made to products have no effect at all on customer behavior. This can be hard news to accept, and it's one of the major reasons not to split-test. Who among us really wants to find out that our hard work is for nothing? Yet building something nobody wants is the ultimate form of waste, and the only way to get better at avoiding it is to get regular feedback. Split-testing is the best way I know to get that feedback.

My approach to split-testing is to try to make it easy in two ways: incredibly easy for the implementers to create the tests and incredibly easy for everyone to understand the results. The goal is to have split-testing be a continuous part of our development process, so much so that it is considered a completely routine part of developing a new feature. In fact, I've seen this approach work so well that it would be considered weird and kind of silly for anyone to ship a new feature without subjecting it to a split-test. That's when this approach can pay huge dividends.

Reports
Let's start with the reporting side of the equation. We want a simple report format that anyone can understand, and that is generic enough that the same report can be used for many different tests. I usually use a "funnel report" that looks like this:


Control Hypothesis A
Hypothesis B
Registered1000 (100%)
1000 (100%)
500 (100%)
Downloaded
650 (65%)
750 (75%)
200 (40%)
Chatted
350 (35%)
350 (35%)
100 (20%)
Purchased
100 (10%)
100 (10%)
25 (5%)


In this case, you could run the report for any time period. The report is set up to show you what happened to customers who registered in that period (a so-called cohort analysis). For each cohort, we can learn what percentage of them did each action we care about. This report is set up to tell you about new customers specifically. You can do this for any sequence of actions, not just ones relating to new customers.

If you take a look at the dummy data above, you'll see that Hypothesis A is clearly better than Hypothesis B, because it beats out B in each stage of the funnel. But compared to control, it only beats it up through the "Chatted" stage. This kind of result is typical when you ship a redesign of some part of your product. The new design improved on the old one in several ways, but these improvements didn't translate all the way through the funnel. Usually, I think that means you've lost some good aspect of the old design. In other words, you're not done with your redesign yet. The designers might be telling you that the new design looks much better than the old one, and that's probably true. But it's worth conducting some more experiments to find a new design that beats the old one all the way through. In my previous job, this led us to confront the disappointing reality that sometimes customers actually prefer an uglier design to a pretty one. Without split-testing, your product tends to get prettier over time. With split-testing, it tends to get more effective.

One last note on reporting. Sometimes it makes sense to measure the micro-impact of a micro-change. For example, by making this button green, did more people click on it? But in my experience this is not useful most of the time. That green button was part of a customer flow, a series of actions you want customers to complete for some business reason. If it's part of a viral loop, it's probably trying to get them to invite more friends (on average). If it's part of an e-commerce site, it's probably trying to get them to buy more things. Whatever its purpose, try measuring it only at the level that you care about. Focus on the output metrics of that part of the product, and you make the problem a lot more clear. It's one of those situations where more data can impede learning.

I had the opportunity to pioneer this approach to funnel analysis at IMVU, where it became a core part of our customer development process. To promote this metrics discipline, we would present the full funnel to our board (and advisers) at the end of every development cycle. It was actually my co-founder Will Harvey who taught me to present this data in the simple format we've discussed in this post. And we were fortunate to have Steve Blank, the originator of customer development, on our board to keep us honest.

Code
To make split-testing pervasive, it has to be incredibly easy. With an online service, we can make it as easy to do a split-test as to not do one. Whenever you are developing a new feature, or modifying an existing feature, you already have a split-test situation. You have the product as it will exist (in your mind), and the product as it exists already. The only change you have to get used to as you start to code in this style, is to wrap your changes in a simple one-line condition. Here's what the one-line split-test looks like in pseudocode:


if( setup_experiment(...) == "control" ) {
// do it the old way
} else {
// do it the new way
}


The call to setup_experiment has to do all of the work, which for a web application involves a sequence something like this:
  1. Check if this experiment exists. If not, make an entry in the experiments list that includes the hypotheses included in the parameters of this call.
  2. Check if the currently logged-in user is part of this experiment already. If she is, return the name of the hypothesis she was exposed to before.
  3. If the user is not part of this experiment yet, pick a hypothesis using the weightings passed in as parameters.
  4. Make a note of which hypothesis this user was exposed to. In the case of a registered user, this could be part of their permanent data. In the case of a not-yet-registered user, you could record it in their session state (and translate it to their permanent state when they do register).
  5. Return the name of the hypothesis chosen or assigned.
From the point of view of the caller of the function, they just pass in the name of the experiment and its various hypotheses. They don't have to worry about reporting, or assignment, or weighting, or, well, anything else. They just ask "which hypothesis should I show?" and get the answer back as a string. Here's what a more fleshed out example might look like in PHP:



$hypothesis =
setup_experiment("FancyNewDesign1.2",
array(array("control", 50),
array("design1", 50)));
if( $hypothesis == "control" ) {
// do it the old way
} elseif( $hypothesis == "design1" ) {
// do it the fancy new way
}
In this example, we have a simple even 50-50 split test between the way it was (called "control") and a new design (called "design1").

Now, it may be that these code examples have scared off our non-technical friends. But for those that persevere, I hope this will prove helpful as an example you can show to your technical team. Most of the time when I am talking to a mixed team with both technical and business backgrounds, the technical people start worrying that this approach will mean massive amounts of new work for them. But the discipline of split-testing should be just the opposite: a way to save massive amounts of time. (See Ideas. Code. Data. Implement. Measure. Learn for more on why this savings is so valuable)


Hypothesis testing vs hypothesis generation
I have sometimes opined that split-testing is the "gold standard" of customer feedback. This gets me into trouble, because it conjures up for some the idea that product development is simply a rote mechanical exercise of linear optimization. You just constantly test little micro-changes and follow a hill-climbing algorithm to build your product. This is not what I have in mind. Split-testing is ideal when you want to put your ideas to the test, to find out whether what you think is really what customers want. But where do those ideas come from in the first place? You need to make sure you don't get away from trying bold new things, using some combination of your vision and in-depth customer conversations to come up with the next idea to try. Split-testing doesn't have to be limited to micro-optimizations, either. You can use it to test out large changes as well as small. That's why it's important to keep the reporting focused on the macro statistics that you care about. Sometimes, small changes make a big difference. Other times, large changes make no difference at all. Split-testing can help you tell which is which.

Further reading
The best paper I have read on split-testing is "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO" - it describes the techniques and rationale used for experiments at Amazon. One of the key lessons they emphasize is that, in the absence of data about what customers want, companies generally revert to the Highest Paid Person's Opinion (hence, HiPPO). But an even more important idea is that it's important to have the discipline to insist that any product change that doesn't change metrics in a positive direction should be reverted. Even if the change is "only neutral" and you really, really, really like it better, force yourself (and your team) to go back to the drawing board and try again. When you started working on that change, surely you had some idea in mind of what it would accomplish for your business. Check your assumptions, what went wrong? Why did customers like your change so much that they didn't change their behavior one iota?

0 comments:

welcome to my blog. please write some comment about this article ^_^