Learning is better than optimization (the local maximum problem)
Lean startups don’t optimize. At least, not in the traditional sense of trying to squeeze every tenth of a point out of a conversion metric or landing page. Instead, we try to accelerate with respect to validated learning about customers.
For example, I’m a big believer in split-testing. Many optimizers are in favor of split-testing, too: direct marketers, landing page and SEO experts -- heck even the Google Website Optimizer team. But our interest in the tactic of split-testing is only superficially similar.
Take the infamous “41 shades of blue” split-test. I understand and respect why optimizers want to do tests like that. There are often counter-intuitive changes in customer behavior that depend on little details. In fact, the curse of product development is that sometimes small things make a huge difference and sometimes huge things make no difference. Split-testing is great for figuring out which is which.
But what do you learn from the “41 shades of blue” test? You only learn which specific shade of blue customers are more likely to click on. And, in most such tests, the differences are quite small, which is why sample sizes have to be very large. In Google’s case, often in the millions of people. When people (ok, engineers) who have been trained in this model enter most startups, they quickly get confused. How can we do split-testing when we have only a pathetically small number of customers? What’s the point when the tests aren’t going to be statistically significant?
And they’re not the only ones. Some designers also hate optimizing (which is why the “41 shades of blue” test is so famous – a famous designer claims to have quit over it). I understand and respect that feeling, too. After you’ve spent months on a painstaking new design, who wants to be told what color blue to use? Split-testing a single element in an overall coherent design seems ludicrous. Even if it shows improvement in some micro metric, does that invalidate the overall design? After all, most coherent designs have a gestalt that is more than the sum of the parts – at least, that’s the theory. Split-testing seems fundamentally at odds with that approach.
But I’m not done with the complaints, yet. Optimizing sounds bad for visionary thinking. That’s why you hear so many people proclaim proudly that they never listen to customers. Customers can only tell you want they think they want, and tend to have a very near-term perspective. If you just build what they tell you, you generally wind up with a giant, incoherent mess. Our job as entrepreneurs is to invent the future, and any optimization technique – including split-testing, many design techniques, or even usability testing – can lead us astray. Sure, customers think they want something, but how do they know what they will want in the future?
You can always tell who has a math background in a startup, because they call this the local maximum problem. Those of us with a computer science background call it the hill-climbing algorithm. I’m sure other disciplines have their own names for it; even protozoans exhibit this behavior (it's called taxis). It goes like this: whenever you’re not sure what to do, try something small, at random, and see if that makes things a little bit better. If it does, keep doing more of that, and if it doesn’t, try something else random and start over. Imagine climbing a hill this way; it’d work with your eyes closed. Just keep seeking higher and higher terrain, and rotate a bit whenever you feel yourself going down. But what if you’re climbing a hill that is in front of a mountain? When you get to the top of the hill, there’s no small step you can take that will get you on the right path up the mountain. That’s the local maximum. All optimization techniques get stuck in this position.
Because this causes a lot of confusion, let me state this as unequivocally as I can. The Lean Startup methodology does not advocate using optimization techniques to make startup decisions. That’s right. You don’t have to listen to customers, you don’t have to split-test, and you are free to ignore any data you want. This isn’t kindergarten. You don’t get a gold star for listening to what customers say. You only get a gold star for achieving results.
What should you do instead? The general pattern is: have a strong vision, test that vision against reality, and then decide whether to pivot or persevere. Each part of that answer is complicated, and I’ve written extensively on the details of how to do each. What I want to convey here is how to respond to the objections I mentioned at the start. Each of those objections is wise, in its own way, and the common reaction – to just reject that thinking outright – is a bad idea. Instead, the Lean Startup offers ways to incorporate those people into an overall feedback loop of learning and discovery.
So when should we split-test? There’s nothing wrong with using split-testing, as part of the solution team, to do optimization. But that is not a substitute for testing big hypotheses. The right split-tests to run are ones that put big ideas to the test. For example, we could split-test what color to make the “Register Now” button. But how much do we learn from that? Let’s say that customers prefer one color over another? Then what? Instead, how about a test where we completely change the value proposition on the landing page?
I remember the first time we changed the landing page at IMVU from offering “avatar chat” to “3D instant messaging.” We didn’t expect much of a difference, but it dramatically changed customer behavior. That was evident in the metrics and in the in-person usability tests. It taught us some important things about our customers: that they had no idea what an avatar was, they had no idea why they would want one, and they thought “avatar chat” was something weird people would do. When we started using “3D instant messaging,” we validated our hypothesis that IM was an activity our customers understood and were interested in “doing better.” But we also invalidated a hypothesis that customers wanted an avatar; we had to learn a whole new way of explaining the benefits of avatar-mediated communication because our audience didn’t know what that word meant.
However, that is not the end of the story. If you go to IMVU’s website today, you won’t find any mention of “3D instant messaging.” That’s because those hypotheses were replaced by yet more, each of which was subject to this kind of macro-level testing. Over many years, we’ve learned a lot about what customers want. And we’ve validated that learning by being able to demonstrate that when we change the product as a result of that learning, the key macro metrics improve.
A good rule of thumb for split-testing is that even when we’re doing micro-level split-tests, we should always measure the macro. So even if you want to test a new button color, don’t measure the click-through rate on that button! Instead, ask yourself: “why do we care that customers click that button?” If it’s a “Register Now” button, it’s because we want customers to sign up and try the product. So let’s measure the percentage of customers who try the product. If the button color change doesn’t have an impact there – it’s too small, and should be reverted. Over time, this discipline helps us ignore the minor stuff and focus our energies on learning what will make a significant impact. (It also just so happens that this style of reporting is easier to implement; you can read more here)
Next, let’s take on the sample-size issue. Most of us learn about the samples sizes from things like political polling. In a large country, in order to figure out who will win an election with any kind of accuracy, you need to sample a large number of people. What most of us forget is that statistical significance is a function of both sample size and the magnitude of the underlying signal. Presidential elections are often decided by a few percentage points or less. When we’re optimizing, product development teams encounter similar situations. But when we’re learning, that’s the rare exception. Recall that the biggest source of waste in product development is building something nobody wants. In that case, you don’t need a very large sample.
Let me illustrate. I’ve previously documented that early-on in IMVU’s life, we made the mistake of building an IM add-on product instead of a standalone network. Believe me, I had to be dragged kicking and screaming to the realization that we’d made a mistake. Here’s how it went down. We would bring customers in for a usability test, and ask them to use the IM add-on functionality. The first one flat-out refused. I mean, here we are, paying them to be there, and they won’t use the product! (For now, I won’t go into the reasons why – if you want that level of detail, you can watch this interview.) I was the head of product development, so can you guess what my reaction was? It certainly wasn’t “ooh, let’s listen to this customer.” Hell no, “fire that customer! Get me a new one” was closer. After all, what is a sample size of one customer? Too small. Second customer: same result. Third, fourth, fifth: same. Now, what are the odds that five customers in a row refuse to use my product, and it’s just a matter of chance or small sample size? No chance. The product sucks – and that is a statistically significant result.
When we switch from an optimization mindset to a learning mindset, design gets more fun, too. It takes some getting used to for most designers, though. They are not generally used to having their designs evaluated by their real-world impact. Remember that plenty of design organizations and design schools give out awards for designing products that never get built. So don’t hold it against a classically trained designer if they find split-testing a little off-putting at first. The key is to get new designers integrated with a split-testing regimen as soon as possible. It’s a good deal: by testing to make sure (I often say “double check”) each design actually improves customers lives, startups can free designers to take much bigger risks. Want to try out a wacky, radical, highly simplified design? In a non-data-driven environment, this is usually impossible. There’s always that engineer in the back of the room with all the corner cases: “but how will customers find Feature X? What happens if we don’t explain in graphic detail how to use Feature Y?” Now these questions have an easy answer: we’ll measure and see. If the new design performs worse than the current design, we’ll iterate and try again. But if it performs better, we don’t need to keep arguing. We just keep iterating and learning. This kind of setup leads to a much less political and much less arbitrary design culture.
This same approach can also lead us out of the big incoherent mess problem. Teams that focus on optimizing can get stuck bolting on feature upon feature until the product becomes unusable. No one feature is to blame. I've made this mistake many times in my career, especially early on when I first began to understand the power of metrics. When that happens, the solution is to do a whole product pivot. "Whole product" is a term I learned from Bill Davidow's classic Marketing High Technology. A whole product is one that works for mainstream customers. Sometimes, a whole product is much bigger than a simple device - witness Apple's mastery of creating a whole ecosystem around each of their devices that make them much more useful than their competitors. But sometimes a whole product is much less - it requires removing unnecessary features and focusing on a single overriding value proposition. And these kinds of pivots are great opportunities for learning-style tests. It only requires the courage to test the new beautiful whole product design against the old crufty one head-to-head.
Systematically testing the assumptions that support the vision is called customer development, and it’s a parallel process to product development. And therein lies the most common source of confusion about whether startups should listen to customers. Even if a startup is doing user-centered design, or optimizing their product through split-testing, or conducting tons of surveys and usability tests, that’s no substitute for also doing customer development. It’s the difference between asking “how should we best solve this problem for these customers?” and “what problem should we be solving? and for which customer?” These two activities have to happen in parallel, forming a company-wide feedback loop. We call such companies built to learn. Their speed should be measured in validated learning about customers, not milestones, features, revenue, or even beautiful design. Again, not because those things aren’t important, but because their role in a startup is subservient to the company’s fundamental purpose: piercing the veil of extreme uncertainty that accompanies any disruptive innovation.
The Lean Startup methodology can’t guarantee you won’t find yourself in a local maximum. But it can guarantee that you’ll know about it when it happens. Even better, when it is time to pivot, you’ll have actual data that can help inform where you want to head next. The data doesn’t tell you what to do – that’s your job. The bad news: entrepreneurship requires judgment. The good news: when you make data-based decisions, you are training your judgment to get better over time.
For example, I’m a big believer in split-testing. Many optimizers are in favor of split-testing, too: direct marketers, landing page and SEO experts -- heck even the Google Website Optimizer team. But our interest in the tactic of split-testing is only superficially similar.
Take the infamous “41 shades of blue” split-test. I understand and respect why optimizers want to do tests like that. There are often counter-intuitive changes in customer behavior that depend on little details. In fact, the curse of product development is that sometimes small things make a huge difference and sometimes huge things make no difference. Split-testing is great for figuring out which is which.
But what do you learn from the “41 shades of blue” test? You only learn which specific shade of blue customers are more likely to click on. And, in most such tests, the differences are quite small, which is why sample sizes have to be very large. In Google’s case, often in the millions of people. When people (ok, engineers) who have been trained in this model enter most startups, they quickly get confused. How can we do split-testing when we have only a pathetically small number of customers? What’s the point when the tests aren’t going to be statistically significant?
And they’re not the only ones. Some designers also hate optimizing (which is why the “41 shades of blue” test is so famous – a famous designer claims to have quit over it). I understand and respect that feeling, too. After you’ve spent months on a painstaking new design, who wants to be told what color blue to use? Split-testing a single element in an overall coherent design seems ludicrous. Even if it shows improvement in some micro metric, does that invalidate the overall design? After all, most coherent designs have a gestalt that is more than the sum of the parts – at least, that’s the theory. Split-testing seems fundamentally at odds with that approach.
But I’m not done with the complaints, yet. Optimizing sounds bad for visionary thinking. That’s why you hear so many people proclaim proudly that they never listen to customers. Customers can only tell you want they think they want, and tend to have a very near-term perspective. If you just build what they tell you, you generally wind up with a giant, incoherent mess. Our job as entrepreneurs is to invent the future, and any optimization technique – including split-testing, many design techniques, or even usability testing – can lead us astray. Sure, customers think they want something, but how do they know what they will want in the future?
You can always tell who has a math background in a startup, because they call this the local maximum problem. Those of us with a computer science background call it the hill-climbing algorithm. I’m sure other disciplines have their own names for it; even protozoans exhibit this behavior (it's called taxis). It goes like this: whenever you’re not sure what to do, try something small, at random, and see if that makes things a little bit better. If it does, keep doing more of that, and if it doesn’t, try something else random and start over. Imagine climbing a hill this way; it’d work with your eyes closed. Just keep seeking higher and higher terrain, and rotate a bit whenever you feel yourself going down. But what if you’re climbing a hill that is in front of a mountain? When you get to the top of the hill, there’s no small step you can take that will get you on the right path up the mountain. That’s the local maximum. All optimization techniques get stuck in this position.
Because this causes a lot of confusion, let me state this as unequivocally as I can. The Lean Startup methodology does not advocate using optimization techniques to make startup decisions. That’s right. You don’t have to listen to customers, you don’t have to split-test, and you are free to ignore any data you want. This isn’t kindergarten. You don’t get a gold star for listening to what customers say. You only get a gold star for achieving results.
What should you do instead? The general pattern is: have a strong vision, test that vision against reality, and then decide whether to pivot or persevere. Each part of that answer is complicated, and I’ve written extensively on the details of how to do each. What I want to convey here is how to respond to the objections I mentioned at the start. Each of those objections is wise, in its own way, and the common reaction – to just reject that thinking outright – is a bad idea. Instead, the Lean Startup offers ways to incorporate those people into an overall feedback loop of learning and discovery.
So when should we split-test? There’s nothing wrong with using split-testing, as part of the solution team, to do optimization. But that is not a substitute for testing big hypotheses. The right split-tests to run are ones that put big ideas to the test. For example, we could split-test what color to make the “Register Now” button. But how much do we learn from that? Let’s say that customers prefer one color over another? Then what? Instead, how about a test where we completely change the value proposition on the landing page?
I remember the first time we changed the landing page at IMVU from offering “avatar chat” to “3D instant messaging.” We didn’t expect much of a difference, but it dramatically changed customer behavior. That was evident in the metrics and in the in-person usability tests. It taught us some important things about our customers: that they had no idea what an avatar was, they had no idea why they would want one, and they thought “avatar chat” was something weird people would do. When we started using “3D instant messaging,” we validated our hypothesis that IM was an activity our customers understood and were interested in “doing better.” But we also invalidated a hypothesis that customers wanted an avatar; we had to learn a whole new way of explaining the benefits of avatar-mediated communication because our audience didn’t know what that word meant.
However, that is not the end of the story. If you go to IMVU’s website today, you won’t find any mention of “3D instant messaging.” That’s because those hypotheses were replaced by yet more, each of which was subject to this kind of macro-level testing. Over many years, we’ve learned a lot about what customers want. And we’ve validated that learning by being able to demonstrate that when we change the product as a result of that learning, the key macro metrics improve.
A good rule of thumb for split-testing is that even when we’re doing micro-level split-tests, we should always measure the macro. So even if you want to test a new button color, don’t measure the click-through rate on that button! Instead, ask yourself: “why do we care that customers click that button?” If it’s a “Register Now” button, it’s because we want customers to sign up and try the product. So let’s measure the percentage of customers who try the product. If the button color change doesn’t have an impact there – it’s too small, and should be reverted. Over time, this discipline helps us ignore the minor stuff and focus our energies on learning what will make a significant impact. (It also just so happens that this style of reporting is easier to implement; you can read more here)
Next, let’s take on the sample-size issue. Most of us learn about the samples sizes from things like political polling. In a large country, in order to figure out who will win an election with any kind of accuracy, you need to sample a large number of people. What most of us forget is that statistical significance is a function of both sample size and the magnitude of the underlying signal. Presidential elections are often decided by a few percentage points or less. When we’re optimizing, product development teams encounter similar situations. But when we’re learning, that’s the rare exception. Recall that the biggest source of waste in product development is building something nobody wants. In that case, you don’t need a very large sample.
Let me illustrate. I’ve previously documented that early-on in IMVU’s life, we made the mistake of building an IM add-on product instead of a standalone network. Believe me, I had to be dragged kicking and screaming to the realization that we’d made a mistake. Here’s how it went down. We would bring customers in for a usability test, and ask them to use the IM add-on functionality. The first one flat-out refused. I mean, here we are, paying them to be there, and they won’t use the product! (For now, I won’t go into the reasons why – if you want that level of detail, you can watch this interview.) I was the head of product development, so can you guess what my reaction was? It certainly wasn’t “ooh, let’s listen to this customer.” Hell no, “fire that customer! Get me a new one” was closer. After all, what is a sample size of one customer? Too small. Second customer: same result. Third, fourth, fifth: same. Now, what are the odds that five customers in a row refuse to use my product, and it’s just a matter of chance or small sample size? No chance. The product sucks – and that is a statistically significant result.
When we switch from an optimization mindset to a learning mindset, design gets more fun, too. It takes some getting used to for most designers, though. They are not generally used to having their designs evaluated by their real-world impact. Remember that plenty of design organizations and design schools give out awards for designing products that never get built. So don’t hold it against a classically trained designer if they find split-testing a little off-putting at first. The key is to get new designers integrated with a split-testing regimen as soon as possible. It’s a good deal: by testing to make sure (I often say “double check”) each design actually improves customers lives, startups can free designers to take much bigger risks. Want to try out a wacky, radical, highly simplified design? In a non-data-driven environment, this is usually impossible. There’s always that engineer in the back of the room with all the corner cases: “but how will customers find Feature X? What happens if we don’t explain in graphic detail how to use Feature Y?” Now these questions have an easy answer: we’ll measure and see. If the new design performs worse than the current design, we’ll iterate and try again. But if it performs better, we don’t need to keep arguing. We just keep iterating and learning. This kind of setup leads to a much less political and much less arbitrary design culture.
This same approach can also lead us out of the big incoherent mess problem. Teams that focus on optimizing can get stuck bolting on feature upon feature until the product becomes unusable. No one feature is to blame. I've made this mistake many times in my career, especially early on when I first began to understand the power of metrics. When that happens, the solution is to do a whole product pivot. "Whole product" is a term I learned from Bill Davidow's classic Marketing High Technology. A whole product is one that works for mainstream customers. Sometimes, a whole product is much bigger than a simple device - witness Apple's mastery of creating a whole ecosystem around each of their devices that make them much more useful than their competitors. But sometimes a whole product is much less - it requires removing unnecessary features and focusing on a single overriding value proposition. And these kinds of pivots are great opportunities for learning-style tests. It only requires the courage to test the new beautiful whole product design against the old crufty one head-to-head.
By now, I hope you’re already anticipating how to answer the visionary’s objections. We don’t split-test or talk to customers to decide if we should abandon our vision. Instead, we test to find out how to achieve the vision in the best possible way. Startup success requires getting many things right all at once: building a product that solves a customer problem, having that problem be an important one to a sufficient number of customers, having those customers be willing pay for it (in one of the four customer currencies), being able to reach those customers through one of the fundamental growth strategies, etc. When you read stories of successful startups in the popular and business press, you usually hear about how the founders anticipated several of these challenges in their initial vision. Unfortunately, startup success requires getting them all right. What the PR stories tend to leave out is that we can get attached to every part of our vision, even the dumb parts. Testing the parts simply gives us information that can help us refine the vision – like a sculptor removing just the right pieces of marble. There is tremendous art to knowing which pieces of the vision to test first. It is highly context-dependent, which is why different startups take dramatically different paths to success. Should you charge from day one, testing the revenue model first? Or should you focus on user engagement or virality? What about companies, like Siebel, that started with partner distribution first? There are no universally right answers to such questions. (For more on how to figure out which question applies in which context, see Business ecology and the four customer currencies.)
Systematically testing the assumptions that support the vision is called customer development, and it’s a parallel process to product development. And therein lies the most common source of confusion about whether startups should listen to customers. Even if a startup is doing user-centered design, or optimizing their product through split-testing, or conducting tons of surveys and usability tests, that’s no substitute for also doing customer development. It’s the difference between asking “how should we best solve this problem for these customers?” and “what problem should we be solving? and for which customer?” These two activities have to happen in parallel, forming a company-wide feedback loop. We call such companies built to learn. Their speed should be measured in validated learning about customers, not milestones, features, revenue, or even beautiful design. Again, not because those things aren’t important, but because their role in a startup is subservient to the company’s fundamental purpose: piercing the veil of extreme uncertainty that accompanies any disruptive innovation.
The Lean Startup methodology can’t guarantee you won’t find yourself in a local maximum. But it can guarantee that you’ll know about it when it happens. Even better, when it is time to pivot, you’ll have actual data that can help inform where you want to head next. The data doesn’t tell you what to do – that’s your job. The bad news: entrepreneurship requires judgment. The good news: when you make data-based decisions, you are training your judgment to get better over time.
0 comments:
welcome to my blog. please write some comment about this article ^_^