Showing posts with label split-test. Show all posts

Case Study: UX, Design, and Food on the Table

0 comments
(One of the common questions I hear is how to reconcile design and user experience (UX) methods with the Lean Startup. To answer, I asked one of my favorite designers to write a case study illustrating the way they work, taking us step-by-step through a real life redesign.

This is something of an IMVU reunion. The attendees at sllconf 2010 were wowed by Food on the Table's presentation. If you weren't there, be sure to watch the video. Manuel Rosso was IMVU's first VP of Marketing, and is now CEO of Food on the Table, one of the leading lean startups in Austin. I first met Laura Klein when we had the good fortune of hiring her at IMVU to join our interaction design team. Since then, she's gone on to become one of the leading experts implementing UX and design in lean startups. 

In this case study, Laura takes us inside the design process in a real live startup. I hope you'll find it illuminating. -Eric)

A lot of people ask me whether design fits into the lean startup process. They're concerned that if they do any research or design up front that they will end up in a waterfall environment.

This is simply not true. Even the leanest of startups can benefit from design and user research. The following is a great example of how they can work together.

A couple of months ago, Manuel Rosso, the CEO of Food on the Table came to me with a problem. He had a product with a great value proposition and thousands of passionate customers. That wasn't the problem. The problem was activation.

As a bit of background, Food on the Table helps people plan meals for their families around what is on sale in their local grocery stores. The team defined an activated user as someone who made it through all the steps of the first time user experience: selecting a grocery store, indicating food preferences, picking recipes, and printing a grocery list.

Users who made it through activation loved the product, but too many first time users were getting lost and never getting all the way to the end.

Identifying The Problem

More than any startup I've worked with, Food on the Table embraces the lean startup methodology. They release early and often. They get tons of feedback from their users. And, most importantly, they measure and a/b test absolutely everything.

Because of their dedication to metrics, they knew all the details of their registration funnel and subsequent user journey. This meant that they knew exactly how many people weren't finishing activation, and they knew that number was higher than they wanted.

Unfortunately, they fell into a trap that far too many startups fall into at some point: they tried to measure their way out of the problem. They would look at a metric, spot a problem, come up with an idea for how to fix it, release a change, and test it. But the needle wasn't moving.

After a couple of months, Manuel had a realization. The team had always been dedicated to listening to users. But as they added new features, their conversations with users had changed - they became more narrowly focused on new features and whether each individual change was usable and useful. Somewhere along the way, they'd stopped observing the entire user experience, from end to end. This didn't last very long - maybe a month or two, but it was long enough to cause problems.

As soon as he realized what had happened, Manuel went back to talking directly to users about their overall experiences rather than just doing targeted usability tests, and within a few hours he knew what had gone wrong. Even though the new features were great in isolation, they were making the overall interface too complicated. New users were simply getting lost on their way to activation.

Now that they knew generally why they were having the problem, Manuel decided he needed a designer to identify the exact pain points and come up with a way to simplify the interface without losing any of the features.

Key Takeaways:
  • Don't try to measure your way out of a problem. Metrics do a great job of telling you what your problem is, but only listening to and observing your users can tell you why they're having trouble.
  • When you're moving fast enough, a product can become confusing in a surprisingly short amount of time. Make sure you're regularly observing the user experience.
  • Adding a new feature can be useful, but it can also clutter up an interface. Good design helps you offer more functionality with less complexity.
Getting an Overview of the Product

When I first came on board, the team had several different experiments going, including a couple of different competing flows. I needed to get a quick overview of the entire user experience in order to understand what was working and what wasn't.

Of course, the best way to do that is to watch new and current customers use the product. In the old days, I would have recruited test participants, brought them into an office, and run usability sessions. It would have taken a couple of weeks.

Not anymore! I scheduled UserTesting.com sessions, making sure that I got participants in all the main branches of the experiments. Within a few hours, I had a dozen 15 minute videos of people using the product. The entire process, including analysis, took about one full day.

Meanwhile, we set up several remote sessions with current users and used GoToMeeting to run fast observational sessions in order to understand the experience of active users. That took another day.

Key Takeaway: Get feedback fast. Online tools like GoToMeeting and UserTesting.com (and about a hundred others) can help you understand the real user experience quickly and cheaply.

Low Hanging Fruit

Once we had a good idea of the major pain points, we decided to split the design changes into two parts: fixing low hanging fruit and making larger, structural changes to the flow. Obviously, we weren't going to let engineering sit around on their hands while we made major design changes.

The most important reason to do this was that some of the biggest problems for users were easy to fix technically and could be accomplished with almost no design input whatsoever.
For example, in one unsuccessful branch of a test, users saw a button that would allow them to add a recipe to a meal plan. When user test participants within the office pressed the button, it would very quickly add the recipe to the meal plan, and users had no problem understanding it. When we observed users pressing the button on their own computers with normal home broadband connections, the button took a few seconds to register the click.

Of course, this meant that users would click the button over and over, since they were getting no feedback. When the script returned, the user would often have added the recipe to their meal plan several times, which wasn't what they meant to do.

This was, by all accounts, a bad user experience. Why wasn't it caught earlier?

Well, as is the case with all software companies, the computers and bandwidth in the office were much better than the typical user's setup, so nobody saw the problem until we watched actual users in their natural environments.

What was the fix? We put in a "wait" spinner and disabled the button while the script was processing. It took literally minutes to implement and delivered a statistically significant improvement in the performance of that branch of the experiment.

Giving immediate feedback drastically reduced user error
Manuel told me that, immediately after that experience, the team added a very old, slow computer to the office and recently caught a nasty problem that could add 40 seconds to page load times. Needless to say, all usability testing within the office is now done on the slowest machine.

Key Takeaways:
  • Sometimes big user problems don't require big solutions.
  • To truly understand what your user is experiencing, you have to understand the user's environment.
  • Sometimes an entire branch of an experiment can be killed by one tiny bug. If your metrics are surprising, do some qualitative research to figure out why!
A Redesign

While the engineering team worked on the low-hanging fruit, we started the redesign. But we didn't just chuck everything out. We started from the current design and iterated. We identified a few critical areas that were making the experience confusing and fixed those.

For example, we started with the observation that people were doing ok for the first couple of screens, but then they were getting confused about what they were supposed to do next. A simple "Step" counter at the top of each page and very clear, obvious "Next" and "Back" buttons told users where they were and what they should do next.

Users also claimed to want more freedom to select their recipes, but they were quickly overwhelmed by the enormous number of options, so we put in a simple and engaging way to select from recommended recipes while still allowing users to access the full collection with the click of one button.

Users were confused by how to change their meal plan
Recommended recipe carousels made choosing a meal plan fun and easy to understand
One common problem was that users asked for a couple of features that were actually already in the product. The features themselves were very useful and well-designed; they just weren't discoverable enough. By changing the location of these features, we made them more obvious to people.

Most importantly, we didn't just jump to Photoshop mockups of the design. Instead, we created several early sketches before moving to interactive wireframes, which we tested and iterated on with current users. In this case, I created the interactive wireframes in HTML and JavaScript. While they were all grayscale with no visual design, they worked. Users could perform the most important actions in them, like stepping through the application, adding meals to their meal plan, and editing recipes. This made participants feel like they were using an actual product so that they could comment not just on the look and feel but on the actual interactions.

By the end of the iterations and tests, every single one of the users liked the new version better than the old, and we had a very good idea why.

Did we make it perfect? No. Perfection takes an awful lot of time and too often fails to be perfect for the intended users.

Instead, we identified several areas we'd like to optimize and iterate on going forward. But we also decided that it was better to release a very good version and continue improving it, rather than aim for absolute perfection and never get it out the door.

The redesign removed all of the major pain points that we'd identified in the testing and created a much simpler, more engaging interface that would allow the team to add features going forward. It improved the user experience and set the stage for lots more iteration and experimentation in the future. In fact, the team currently has several more exciting experiments running!

Key Takeaways:
  • Interactive prototypes and iterative testing let you improve the design quickly before you ever get to the coding stage.
  • Targeting only the confusing parts of the interface for redesign reduces the number of things you need to rebuild and helps make both design and development faster.
  • Lean design is about improving the user experience iteratively! Fixing the biggest user problems first means getting an improved experience to users quickly and optimizing later based on feedback and metrics.
The Metrics

Like any good lean startup, we released the new design in an a/b test with new users. We had a feeling it would be better, but we needed to know whether we were right. We also wanted to make sure there weren't any small problems we'd overlooked that might have big consequences.

After running for about 6 weeks and a few thousand people, we had our statistically significant answer: a 77% increase in the number of new users who were making it all the way through activation.

My entire involvement with the project to do the research, design, and usability testing was just under 90 hours spread over about 6 weeks.

Key Takeaway: Design - even major redesigns - can be part of an agile, lean startup environment, if done in an efficient way with a lot of iteration and customer involvement.



Laura Klein has been working in Silicon Valley as both an engineer and a UX professional for the last 15 years. She currently consults with lean startups to help them make their products easier to use. She frequently blogs about design, usability, metrics, and product management at Users Know. You can follow her on Twitter at @lauraklein.

Read More »

Learning is better than optimization (the local maximum problem)

0 comments
Lean startups don’t optimize. At least, not in the traditional sense of trying to squeeze every tenth of a point out of a conversion metric or landing page. Instead, we try to accelerate with respect to validated learning about customers.

For example, I’m a big believer in split-testing. Many optimizers are in favor of split-testing, too: direct marketers, landing page and SEO experts -- heck even the Google Website Optimizer team. But our interest in the tactic of split-testing is only superficially similar.

Take the infamous “41 shades of blue” split-test. I understand and respect why optimizers want to do tests like that. There are often counter-intuitive changes in customer behavior that depend on little details. In fact, the curse of product development is that sometimes small things make a huge difference and sometimes huge things make no difference. Split-testing is great for figuring out which is which.

But what do you learn from the “41 shades of blue” test? You only learn which specific shade of blue customers are more likely to click on. And, in most such tests, the differences are quite small, which is why sample sizes have to be very large. In Google’s case, often in the millions of people. When people (ok, engineers) who have been trained in this model enter most startups, they quickly get confused. How can we do split-testing when we have only a pathetically small number of customers? What’s the point when the tests aren’t going to be statistically significant?

And they’re not the only ones. Some designers also hate optimizing (which is why the “41 shades of blue” test is so famous – a famous designer claims to have quit over it). I understand and respect that feeling, too. After you’ve spent months on a painstaking new design, who wants to be told what color blue to use? Split-testing a single element in an overall coherent design seems ludicrous. Even if it shows improvement in some micro metric, does that invalidate the overall design? After all, most coherent designs have a gestalt that is more than the sum of the parts – at least, that’s the theory. Split-testing seems fundamentally at odds with that approach.

But I’m not done with the complaints, yet. Optimizing sounds bad for visionary thinking. That’s why you hear so many people proclaim proudly that they never listen to customers. Customers can only tell you want they think they want, and tend to have a very near-term perspective. If you just build what they tell you, you generally wind up with a giant, incoherent mess. Our job as entrepreneurs is to invent the future, and any optimization technique – including split-testing, many design techniques, or even usability testing – can lead us astray. Sure, customers think they want something, but how do they know what they will want in the future?

You can always tell who has a math background in a startup, because they call this the local maximum problem. Those of us with a computer science background call it the hill-climbing algorithm. I’m sure other disciplines have their own names for it; even protozoans exhibit this behavior (it's called taxis). It goes like this: whenever you’re not sure what to do, try something small, at random, and see if that makes things a little bit better. If it does, keep doing more of that, and if it doesn’t, try something else random and start over. Imagine climbing a hill this way; it’d work with your eyes closed. Just keep seeking higher and higher terrain, and rotate a bit whenever you feel yourself going down. But what if you’re climbing a hill that is in front of a mountain? When you get to the top of the hill, there’s no small step you can take that will get you on the right path up the mountain. That’s the local maximum. All optimization techniques get stuck in this position.

Because this causes a lot of confusion, let me state this as unequivocally as I can. The Lean Startup methodology does not advocate using optimization techniques to make startup decisions. That’s right. You don’t have to listen to customers, you don’t have to split-test, and you are free to ignore any data you want. This isn’t kindergarten. You don’t get a gold star for listening to what customers say. You only get a gold star for achieving results.

What should you do instead? The general pattern is: have a strong vision, test that vision against reality, and then decide whether to pivot or persevere. Each part of that answer is complicated, and I’ve written extensively on the details of how to do each. What I want to convey here is how to respond to the objections I mentioned at the start. Each of those objections is wise, in its own way, and the common reaction – to just reject that thinking outright – is a bad idea. Instead, the Lean Startup offers ways to incorporate those people into an overall feedback loop of learning and discovery.

So when should we split-test? There’s nothing wrong with using split-testing, as part of the solution team, to do optimization. But that is not a substitute for testing big hypotheses. The right split-tests to run are ones that put big ideas to the test. For example, we could split-test what color to make the “Register Now” button. But how much do we learn from that? Let’s say that customers prefer one color over another? Then what? Instead, how about a test where we completely change the value proposition on the landing page?

I remember the first time we changed the landing page at IMVU from offering “avatar chat” to “3D instant messaging.” We didn’t expect much of a difference, but it dramatically changed customer behavior. That was evident in the metrics and in the in-person usability tests. It taught us some important things about our customers: that they had no idea what an avatar was, they had no idea why they would want one, and they thought “avatar chat” was something weird people would do. When we started using “3D instant messaging,” we validated our hypothesis that IM was an activity our customers understood and were interested in “doing better.” But we also invalidated a hypothesis that customers wanted an avatar; we had to learn a whole new way of explaining the benefits of avatar-mediated communication because our audience didn’t know what that word meant.

However, that is not the end of the story. If you go to IMVU’s website today, you won’t find any mention of “3D instant messaging.” That’s because those hypotheses were replaced by yet more, each of which was subject to this kind of macro-level testing. Over many years, we’ve learned a lot about what customers want. And we’ve validated that learning by being able to demonstrate that when we change the product as a result of that learning, the key macro metrics improve.

A good rule of thumb for split-testing is that even when we’re doing micro-level split-tests, we should always measure the macro. So even if you want to test a new button color, don’t measure the click-through rate on that button! Instead, ask yourself: “why do we care that customers click that button?” If it’s a “Register Now” button, it’s because we want customers to sign up and try the product. So let’s measure the percentage of customers who try the product. If the button color change doesn’t have an impact there – it’s too small, and should be reverted. Over time, this discipline helps us ignore the minor stuff and focus our energies on learning what will make a significant impact. (It also just so happens that this style of reporting is easier to implement; you can read more here)

Next, let’s take on the sample-size issue. Most of us learn about the samples sizes from things like political polling. In a large country, in order to figure out who will win an election with any kind of accuracy, you need to sample a large number of people. What most of us forget is that statistical significance is a function of both sample size and the magnitude of the underlying signal. Presidential elections are often decided by a few percentage points or less. When we’re optimizing, product development teams encounter similar situations. But when we’re learning, that’s the rare exception. Recall that the biggest source of waste in product development is building something nobody wants. In that case, you don’t need a very large sample.

Let me illustrate. I’ve previously documented that early-on in IMVU’s life, we made the mistake of building an IM add-on product instead of a standalone network. Believe me, I had to be dragged kicking and screaming to the realization that we’d made a mistake. Here’s how it went down. We would bring customers in for a usability test, and ask them to use the IM add-on functionality. The first one flat-out refused. I mean, here we are, paying them to be there, and they won’t use the product! (For now, I won’t go into the reasons why – if you want that level of detail, you can watch this interview.) I was the head of product development, so can you guess what my reaction was? It certainly wasn’t “ooh, let’s listen to this customer.” Hell no, “fire that customer! Get me a new one” was closer. After all, what is a sample size of one customer? Too small. Second customer: same result. Third, fourth, fifth: same. Now, what are the odds that five customers in a row refuse to use my product, and it’s just a matter of chance or small sample size? No chance. The product sucks – and that is a statistically significant result.

When we switch from an optimization mindset to a learning mindset, design gets more fun, too. It takes some getting used to for most designers, though. They are not generally used to having their designs evaluated by their real-world impact. Remember that plenty of design organizations and design schools give out awards for designing products that never get built. So don’t hold it against a classically trained designer if they find split-testing a little off-putting at first. The key is to get new designers integrated with a split-testing regimen as soon as possible. It’s a good deal: by testing to make sure (I often say “double check”) each design actually improves customers lives, startups can free designers to take much bigger risks. Want to try out a wacky, radical, highly simplified design? In a non-data-driven environment, this is usually impossible. There’s always that engineer in the back of the room with all the corner cases: “but how will customers find Feature X? What happens if we don’t explain in graphic detail how to use Feature Y?” Now these questions have an easy answer: we’ll measure and see. If the new design performs worse than the current design, we’ll iterate and try again. But if it performs better, we don’t need to keep arguing. We just keep iterating and learning. This kind of setup leads to a much less political and much less arbitrary design culture.

This same approach can also lead us out of the big incoherent mess problem. Teams that focus on optimizing can get stuck bolting on feature upon feature until the product becomes unusable. No one feature is to blame. I've made this mistake many times in my career, especially early on when I first began to understand the power of metrics. When that happens, the solution is to do a whole product pivot. "Whole product" is a term I learned from Bill Davidow's classic Marketing High Technology. A whole product is one that works for mainstream customers. Sometimes, a whole product is much bigger than a simple device - witness Apple's mastery of creating a whole ecosystem around each of their devices that make them much more useful than their competitors. But sometimes a whole product is much less - it requires removing unnecessary features and focusing on a single overriding value proposition. And these kinds of pivots are great opportunities for learning-style tests. It only requires the courage to test the new beautiful whole product design against the old crufty one head-to-head.

By now, I hope you’re already anticipating how to answer the visionary’s objections. We don’t split-test or talk to customers to decide if we should abandon our vision. Instead, we test to find out how to achieve the vision in the best possible way. Startup success requires getting many things right all at once: building a product that solves a customer problem, having that problem be an important one to a sufficient number of customers, having those customers be willing pay for it (in one of the four customer currencies), being able to reach those customers through one of the fundamental growth strategies, etc. When you read stories of successful startups in the popular and business press, you usually hear about how the founders anticipated several of these challenges in their initial vision. Unfortunately, startup success requires getting them all right. What the PR stories tend to leave out is that we can get attached to every part of our vision, even the dumb parts. Testing the parts simply gives us information that can help us refine the vision – like a sculptor removing just the right pieces of marble. There is tremendous art to knowing which pieces of the vision to test first. It is highly context-dependent, which is why different startups take dramatically different paths to success. Should you charge from day one, testing the revenue model first? Or should you focus on user engagement or virality? What about companies, like Siebel, that started with partner distribution first?  There are no universally right answers to such questions. (For more on how to figure out which question applies in which context, see Business ecology and the four customer currencies.)

Systematically testing the assumptions that support the vision is called customer development, and it’s a parallel process to product development. And therein lies the most common source of confusion about whether startups should listen to customers. Even if a startup is doing user-centered design, or optimizing their product through split-testing, or conducting tons of surveys and usability tests, that’s no substitute for also doing customer development. It’s the difference between asking “how should we best solve this problem for these customers?” and “what problem should we be solving? and for which customer?” These two activities have to happen in parallel, forming a company-wide feedback loop. We call such companies built to learn. Their speed should be measured in validated learning about customers, not milestones, features, revenue, or even beautiful design. Again, not because those things aren’t important, but because their role in a startup is subservient to the company’s fundamental purpose: piercing the veil of extreme uncertainty that accompanies any disruptive innovation.

The Lean Startup methodology can’t guarantee you won’t find yourself in a local maximum. But it can guarantee that you’ll know about it when it happens. Even better, when it is time to pivot, you’ll have actual data that can help inform where you want to head next. The data doesn’t tell you what to do – that’s your job. The bad news: entrepreneurship requires judgment. The good news: when you make data-based decisions, you are training your judgment to get better over time.

Read More »

Innovation inside the box

0 comments
I was recently privy to a product prioritization meeting in a relatively large company. It was fascinating. The team spent an hour trying to decide on a new pricing strategy for their main product line. One of the divisions, responsible for the company’s large accounts, was requesting data about a recent experiment that had been conducted by another division. They were upset because this other team had changed the prices for small accounts to make the product more affordable. The larger-account division wanted to move the pricing in just the other direction – making the low-end products more expensive, so their large customers would have an increased incentive to upgrade.

Almost the entire meeting was taken up with interpreting data. The problem was that nobody could quite agree what the data meant. Many custom reports had been created for this meeting, and the data warehouse team was in the meeting, too. The more they were asked to explain the details of each row on the spreadsheet, the more evident it became that nobody understood how those numbers had been derived.

Worse, nobody was quite sure exactly which customers had been exposed to the experiment. Different teams had been responsible for implementing different parts of it, and so different parts of the product had been updated at different times. The whole process had taken. And by now, the people who had originally conceived the experiment were in a separate division from the people who had executed it.

Listening in, I assumed this would be the end of the meeting. With no agreed-upon facts to help make the decision, I assumed nobody would have any basis for making the case for any particular action. Boy was I wrong. The meeting was just getting started. Each team simply took whatever interpretation of the data supported their position best, and started advocating. Other teams would chime in with alternate interpretation that supported their position, and so on. In the end, decisions were made – but not based on any actual data. Instead, the executive running the meeting was forced to make decisions based on the best arguments.

The funny thing to me was how much of the meeting had been spent debating the data, when in the end, the arguments that carried the day could have been made right at the start of the emeting. It was as if each advocate sensed that they were about to be ambushed; if another team had managed to bring clarity to the situation, that might have benefited them – so the rational response was to obfuscate as much as possible. What a waste.

Ironically, meetings like this had given data and experimentation a bad name inside this company. And who can blame them? The data warehousing team was producing classic waste – reports that nobody read (or understood). The project teams felt these experiments were a waste of time, since they involved building features halfway, which meant they were never quite any good. And since nobody could agree on each outcome, it seemed like “running an experiment” was just code for postponing a hard decision. Worst of all, the executive team was getting chronic headaches. Their old product prioritization meetings may have been a battle of opinions, but at least they understood what was going on. Now they first had to go through a ritual that involved complex math, reached no definite outcome, and then proceeded to have a battle of opinions anyway!

When a company gets wedged like this, the solution is often surprisingly simple. In fact, I call this class of solutions “too simple to possibly work” because the people inside the situation can’t conceive that their complex problem could have a simple solution. When I’m asked to work with companies like this as a consultant, 99% of my job is to find a way to get the team to get started with a simple – but correct – solution.

Here was my prescription for this situation. I asked the team to consider creating what I call a sandbox for experimentation. The sandbox is an area of the product where the following rules are strictly enforced:
  1. Any team can create a true split-test experiment that affects only the sandboxed parts of the product, however:
  2. One team must see the whole experiment through end-to-end.
  3. No experiment can run longer than a specified amount of time (usually a few weeks).
  4. No experiment can affect more than a specified number of customers (usually expressed as a % of total).
  5. Every experiment has to be evaluated based on a single standard report of 5-10 (no more) key metrics.
  6. Any team that creates an experiment must monitor the metrics and customer reactions (support calls, forum threads, etc) while the experiment is in-progress, and abort if something catastrophic happens.
Putting a system like this in place is relatively easy; especially for any kind of online service. I advocate starting small; usually, the parts of the product that start inside the sandbox are low-effort, high-impact aspects like pricing, initial landing pages, or registration flows. These may not sound very exciting, but because they control the product’s positioning for new customers, they often allow minor changes to have a big impact.

Over time, additional parts of the product can be added to the sandbox, until eventually it becomes routine for the company to conduct these rigorous split-tests for even very large new features. But that’s getting ahead of ourselves. The benefits of this approach are manifest immediately. Right from the beginning, the sandbox achieves three key goals simultaneously:

  1. It forces teams to work cross-functionally. The first few changes, like a price change, may not require a lot of engineering effort. But they require coordination across departments – engineering, marketing, customers service. Teams that work this way are more productive, as long as productivity is measured by their ability to create customer value (and not just stay busy).
       
  2. Everyone understands the results. True split-test experiments are easy to classify as successes or failures, because top-level metrics either move or they don’t. Either way, the team learns immediately whether their assumptions about how customers would behave were correct. By using the same metrics each time, the team builds literacy across the whole company about those key metrics.
     
  3. It promotes rapid iteration. When people have a chance to see a project through end-to-end, and the work is done in small batches, and has a clear verdict delivered quickly, they benefit from the power of feedback. Each time they fail to move the numbers, they have a real opportunity for introspection. And, even more importantly, to act on their findings immediately. Thus, these teams tend to converge on optimal solutions rapidly, even if they start out with really bad ideas.
Putting it all together, let me illustrate with an example from another company. This team had been working for many months in a standard agile configuration: a disciplined engineering team taking direction from a product owner who would prioritize the features they should work on. The team was adept at responding to changes in direction from the product owner, and always delivered quality code.

But there was a problem. The team rarely received any feedback about whether the features they were building actually mattered to customers. Whatever learning took place was happening by the product owner; the rest of the team was just heads-down implementing features.

This led to a tremendous amount of waste, of the worst kind: building features nobody wants. We discovered this reality when the team started working inside a sandbox like the one I described above.

When new customers would try this product, they weren’t required to register at first. They could simply come to the website and start using it. Only after they started to have some success would the system prompt them to register – and after that, start to offer them premium features to pay with. It was a slick example of lazy registration and a freemium model. The underlying assumption was that making it seamless for customers to ease into the product was optimal. In order to support that assumption, the team had written a lot of very clever code to create this “tri-mode” experience (every part of the product had to treat guests, registered users and paying users somewhat differently).

One day, the team decided to put that assumption to the test. The experiment was easy to build (although hard to decide to do): simply remove the “guest” experience, and make everyone register right at the start.  To their surprise, the metrics didn’t move at all. Customers who were given the guest experience were not any more likely to register, and they were actually less likely to pay. In other words, all that tri-mode code was complete waste.

By discovering this unpleasant fact, the team had an opportunity to learn. They discovered, as is true of many freemium and lazy registration systems, that easy is not always optimal. When registration is too easy, customers can get confused about what they are registering for. (This is similar to the problem that viral loop companies have with the engagement loop: by making it too easy to join, they actually give away the positioning that allows for longer-term engagement.) More importantly, the experience led to some soul-searching. Why was a team this smart, this disciplined, and this committed to waste-free product development creating so much waste?

That’s the power of the sandbox approach.

Read More »

Getting started with split-testing

0 comments
One of the startup founders I work with asked me a smart question recently, and I thought I'd share it. Unlike most of the people who've endured my one-line split-testing talk, this team has taken it to heart. They're getting started creating their first A/B tests, and asked "Should we split-test EVERYTHING?" In other words, how do you know what to split-test, and what just to ship as-is? After all, isn't it a form of waste to split-test something like SSL certs, that you know you have to do?

I love questions like this, because there is absolutely no right answer. Split-testing, like almost everything in a lean startup, requires judgment. It's an art, not a science. When I was just starting out with practices like split-testing, I too sought out hard and fast rules. (You can read about some of the pitfalls I ran into with split-testing in "When NOT to listen to your users; when NOT to rely on split-tests"). That said, I think it's important to get your split-testing initiative off to a good start, and that means being selective about what features you tackle with it.

The goal in the early days of split-testing is to produce unequivocal results. It's a form of waste to generate reports that people don't understand, and it impedes learning if there is a lot of disagreement about the facts. As you get better at it, you can start to apply it to pretty much everything.

In the meantime, here are three guidelines to get you started:
  1. Start simple. Teams that have been split-testing for a long time get pretty good at constructing tests for complex features, but this is a learned skill. In the short term, tackling something too complex is more likely to lead to a lot of wasted time arguing about the validity of the test. A good place to start is to try moving UI elements around. My favorite is to rearrange the steps of your registration process for new customers. That almost always has an effect, and is usually pretty easy to change.

  2. Make a firm prediction. Later, you'll want to use split-tests as exploratory probes, trying things you truly don't understand well. Don't start there. It's too easy to engage in after-the-fact rationalization when you don't go in with a strong opinion about what's going to happen. Split-testing is most powerful when it causes you to make your assumptions explicit, and then challenge them. So, before you launch the test, write down your belief about what's going to happen. Try to be specific; it's OK to be wrong. This can work for a change you are sure is going to have no effect, too, like changing the color of a button or some minor wording. Either way, make sure you can be wrong. If you're a founder or top-level executive, have the courage to be wrong in a very public way. You send the signal that learning always trumps opinion, even for you.

  3. Don't give up. If the first test shows that your feature has no effect, avoid these two common extreme reactions: abandoning the feature altogether, or abandoning split-testing forever. Set the expectation ahead of time that you will probably have to iterate the feature a few times before you know if it's any good. Use the data you collect from each test to affect the next iteration. If you don't get any effect after a few tries, then you can safely conclude that you're on the right track.

Most importantly, have fun with split-testing. Each experiment is like a little mystery, and if you can get into a mindset of open-mindedness about the answer, the answers will continually surprise and amaze you. Once you get the hang of it, I promise you'll learn a great deal.

Read More »

When NOT to listen to your users; when NOT to rely on split-tests

0 comments
There are three legs to the lean startup concept: agile product development, low-cost (fast to market) platforms, and rapid-iteration customer development. When I have the opportunity to meet startups, they usually have one of these aspects down, and need help with one or two of the others. The most common need is becoming more customer-centric. They need to incorporate customer feedback into the product development and business planning process. I usually recommend two things: try to get the whole team to start talking to customers ("just go meet a few") and get them to use split-testing in their feature release process ("try it, you'll like it").

However, that can't be the end of the story. If all we do is mechanically embrace these tactics, we can wind up with a disaster. Here are two specific ways it can go horribly wrong. Both are related to a common brain defect we engineers and entrepreneurs seem to be especially prone to. I call it "if some is good, more is better" and it can cause us to swing wildly from one extreme of belief to another.

What's needed is a disciplined methodology for understanding the needs of customers and how they combine to form a viable business model. In this post, I'll discuss two particular examples, but for a full treatment, I recommend Steve Blank's The Four Steps to the Epiphany.




Let's start with the "do whatever customers say, no matter what" problem. I'll borrow this example from randomwalker's journal - Lessons from the failure of Livejournal: when NOT to listen to your users.
The opportunity was just mind-bogglingly huge. But none of that happened. The site hung on to its design philosophy of being an island cut off from the rest of the Web, and paid the price. ... The site is now a sad footnote in the history of Social Networking Services. How did they do it? By listening to their users.
randomwalker identifies four specific ways in which LJ's listening caused them problems, and they are all variations on a theme: listening to the wrong users. The early adopters of LiveJournal didn't want to see the site become mainstream, and the team didn't find a way to stand up for their business or vision.

I remember having this problem when I first got the "listening to customers" religion. I felt we should just talk to as many customers as possible, and do whatever they say. But that is a bad idea. It confuses the tactic, which is listening, with the strategy, which is learning. Talking to customers is important because it helps us deal in facts about the world as it is today. If we're going to build a product, we need to have a sense of who will use it. If we're going to change a features, we need to know how our existing customers will react. If we're working on positioning for our product, we need to know what is in the mind of our prospects today.

If your team is struggling with customer feedback, you may find this mantra helpful. Seek out a synthesis that incorporates both the feedback you are hearing plus your own vision. Any path that leaves out one aspect or the other is probably wrong. Have faith that this synthesis is greater than the sum of its parts. If you can't find a synthesis position that works for your customers and for your business, it either means you're not trying hard enough or your business is in trouble. Figure out which one it is, have a heart-to-heart with your team, and make some serious changes.




Especially for us introverted engineering types, there is one major drawback to talking to customers: it's messy. Customers are living breathing complex people, with their own drama and issues. When they talk to you, it can be overwhelming to sort through all that irrelevant data to capture the nuggets of wisdom that are key to learning. In a perfect world, we'd all have the courage and stamina to perservere, and implement a complete Ideas-Code-Data rapid learning loop. But in reality, we sometimes fall back on inadequate shortcuts. One of those is an over-emphasis on split-testing.

Split-testing provides objective facts about our product and customers, and this has strong appeal to the science-oriented among us. But the thing to remember about split-testing is that it is always retrospective - it can only give you facts about the past. Split-testing is completely useless in telling you what to do next. Now, to make good decisions, it's helpful to have historical data about what has and hasn't worked in the past. If you take it too far, though, you can lose the creative spark that is also key to learning.

For example, I have often fallen into the trap of wanting to optimize the heck out of one single variable in our business. One time, I became completely enamored with Influence: The Psychology of Persuasion (which is a great book, but that's for another post). I managed to convince myself that the solution to all of our company's problems were contained in that book, and that if we just faithfully executed a marketing campaign around the principles therein, we'd solve everything. I convinced a team to give this a try, and they did tried dozens of split-test experiments, each around a different principle or combination of principles. We tried and tried to boost our conversion numbers, each time analyzing what worked and what didn't, and iterating. We were excited by each new discovery, and each iteration we managed to move the conversion needle a little bit more. Here was the problem: the total impact we were having was miniscule. It turns out that we were not really addressing the core problem (which had nothing to do with persuasion). So although we felt we were making progress, and even though we were moving numbers on a spreadsheet, it was all for nothing. Only when someone hit me over the head and said "this isn't working, let's try a radically new direction" did I realize what had happened. We'd forgotten to use the all the tools in our toolbox, and lost sight of our overarching goal.

It's important to be open to hearing new ideas, especially when the ideas you're working on are split-testing poorly. That's not to say you should give up right away, but always take a moment to step back and ask yourself if your current path is making progress. It might be time to reshuffle the deck and try again.

Just don't forget to subject the radical new idea to split-testing too. It might be even worse than what you're doing right now.




So, both split-testing and customer feedback have their drawbacks. What can you do about it? There are a few ideas I have found generally helpful:
  • Identify where the "learning block" is. For example, think of the phases of the synthesis framework: collecting feedback, processing and understanding it, choosing a new course of action. If you're not getting the results you want, probably it's because one of those phases is blocked. For example, I've had the opportunity to work with a brilliant product person who had an incredible talent at rationalization. Once he got the "customer feedback" religion, I noticed this pattern: "Guys! I've just conducted three customer focus groups, and, incredibly, the customers really want us to build the feature I've been telling you about for a month." No matter what the input, he'd come around to the same conclusion as before.

    Or maybe you have someone on your team that's just not processing: "Customers say they want X, so that's what we're building." Each new customer that walks in the door wants a different X, so we keep changing direction.

    Or consider my favorite of all: the "we have no choice but to stay the course" pessimist. For this person, there's always some reason why what we're learning about customers can't help. We're doomed! For example, we simply cannot make the changes we need because we've already promised something to partners. Or the press. Or to some passionate customers. Or to our team. Whoever it is, we just can't go back on our promise, it'd be too painful. So we have to roll the dice with what we're working on now, even if we all agree it's not our best shot at success.

    Wherever the blockage is happening, by identifying it you can work on fixing it.

  • Focus on "minimum feature set" whenever processing feedback. It's all too easy to put together a spec that contains every feature that every customer has ever asked for. That's not a challenge. The hard part is to figure out the fewest possible features that could possibly accomplish your company's goals. If you ever have the opportunity to remove a feature without impacting the customer experience or business metrics - do it. If you need help determining what features are truly essential, pay special attention to the Customer Validation phase of Customer Development.

  • Consider whether the company is experiencing a phase-change that might make what's made you successful in the past obsolete. The most famous of these phase-change theories is Crossing the Chasm, which gives very clear guidance about what to do in a situation where you can't seem to make any more progress with the early-adopter customers you have. That's a good time to change course. One possibility: try segmenting your customers into a few archetypes, and see if any of those sounds more promising than another. Even if one archetype currently dominates your customer base, would it be more promising to pursue a different one?
As much as we try to incorporate scientific product development into our work, the fact remains that business is not a science. I think Drucker said it best. It's pretty easy to deliver results in the short term or the long term. It's pretty easy to optimize our business to serve one of employees, customers or shareholders. But it's incredibly hard to balance the needs of all three stakeholders over both the short and long-term time horizon. That's what business is designed to do. By learning to find a synthesis between our customers and our vision, we can make a meaningful contribution to that goal.

Read More »

The one line split-test, or how to A/B all the time

0 comments
Split-testing is a core lean startup discipline, and it's one of those rare topics that comes up just as often in a technical context as in a business-oriented one when I'm talking to startups. In this post I hope to talk about how to do it well, in terms appropriate for both audiences.

First of all, why split-test? In my experience, the majority of changes we made to products have no effect at all on customer behavior. This can be hard news to accept, and it's one of the major reasons not to split-test. Who among us really wants to find out that our hard work is for nothing? Yet building something nobody wants is the ultimate form of waste, and the only way to get better at avoiding it is to get regular feedback. Split-testing is the best way I know to get that feedback.

My approach to split-testing is to try to make it easy in two ways: incredibly easy for the implementers to create the tests and incredibly easy for everyone to understand the results. The goal is to have split-testing be a continuous part of our development process, so much so that it is considered a completely routine part of developing a new feature. In fact, I've seen this approach work so well that it would be considered weird and kind of silly for anyone to ship a new feature without subjecting it to a split-test. That's when this approach can pay huge dividends.

Reports
Let's start with the reporting side of the equation. We want a simple report format that anyone can understand, and that is generic enough that the same report can be used for many different tests. I usually use a "funnel report" that looks like this:


Control Hypothesis A
Hypothesis B
Registered1000 (100%)
1000 (100%)
500 (100%)
Downloaded
650 (65%)
750 (75%)
200 (40%)
Chatted
350 (35%)
350 (35%)
100 (20%)
Purchased
100 (10%)
100 (10%)
25 (5%)


In this case, you could run the report for any time period. The report is set up to show you what happened to customers who registered in that period (a so-called cohort analysis). For each cohort, we can learn what percentage of them did each action we care about. This report is set up to tell you about new customers specifically. You can do this for any sequence of actions, not just ones relating to new customers.

If you take a look at the dummy data above, you'll see that Hypothesis A is clearly better than Hypothesis B, because it beats out B in each stage of the funnel. But compared to control, it only beats it up through the "Chatted" stage. This kind of result is typical when you ship a redesign of some part of your product. The new design improved on the old one in several ways, but these improvements didn't translate all the way through the funnel. Usually, I think that means you've lost some good aspect of the old design. In other words, you're not done with your redesign yet. The designers might be telling you that the new design looks much better than the old one, and that's probably true. But it's worth conducting some more experiments to find a new design that beats the old one all the way through. In my previous job, this led us to confront the disappointing reality that sometimes customers actually prefer an uglier design to a pretty one. Without split-testing, your product tends to get prettier over time. With split-testing, it tends to get more effective.

One last note on reporting. Sometimes it makes sense to measure the micro-impact of a micro-change. For example, by making this button green, did more people click on it? But in my experience this is not useful most of the time. That green button was part of a customer flow, a series of actions you want customers to complete for some business reason. If it's part of a viral loop, it's probably trying to get them to invite more friends (on average). If it's part of an e-commerce site, it's probably trying to get them to buy more things. Whatever its purpose, try measuring it only at the level that you care about. Focus on the output metrics of that part of the product, and you make the problem a lot more clear. It's one of those situations where more data can impede learning.

I had the opportunity to pioneer this approach to funnel analysis at IMVU, where it became a core part of our customer development process. To promote this metrics discipline, we would present the full funnel to our board (and advisers) at the end of every development cycle. It was actually my co-founder Will Harvey who taught me to present this data in the simple format we've discussed in this post. And we were fortunate to have Steve Blank, the originator of customer development, on our board to keep us honest.

Code
To make split-testing pervasive, it has to be incredibly easy. With an online service, we can make it as easy to do a split-test as to not do one. Whenever you are developing a new feature, or modifying an existing feature, you already have a split-test situation. You have the product as it will exist (in your mind), and the product as it exists already. The only change you have to get used to as you start to code in this style, is to wrap your changes in a simple one-line condition. Here's what the one-line split-test looks like in pseudocode:


if( setup_experiment(...) == "control" ) {
// do it the old way
} else {
// do it the new way
}


The call to setup_experiment has to do all of the work, which for a web application involves a sequence something like this:
  1. Check if this experiment exists. If not, make an entry in the experiments list that includes the hypotheses included in the parameters of this call.
  2. Check if the currently logged-in user is part of this experiment already. If she is, return the name of the hypothesis she was exposed to before.
  3. If the user is not part of this experiment yet, pick a hypothesis using the weightings passed in as parameters.
  4. Make a note of which hypothesis this user was exposed to. In the case of a registered user, this could be part of their permanent data. In the case of a not-yet-registered user, you could record it in their session state (and translate it to their permanent state when they do register).
  5. Return the name of the hypothesis chosen or assigned.
From the point of view of the caller of the function, they just pass in the name of the experiment and its various hypotheses. They don't have to worry about reporting, or assignment, or weighting, or, well, anything else. They just ask "which hypothesis should I show?" and get the answer back as a string. Here's what a more fleshed out example might look like in PHP:



$hypothesis =
setup_experiment("FancyNewDesign1.2",
array(array("control", 50),
array("design1", 50)));
if( $hypothesis == "control" ) {
// do it the old way
} elseif( $hypothesis == "design1" ) {
// do it the fancy new way
}
In this example, we have a simple even 50-50 split test between the way it was (called "control") and a new design (called "design1").

Now, it may be that these code examples have scared off our non-technical friends. But for those that persevere, I hope this will prove helpful as an example you can show to your technical team. Most of the time when I am talking to a mixed team with both technical and business backgrounds, the technical people start worrying that this approach will mean massive amounts of new work for them. But the discipline of split-testing should be just the opposite: a way to save massive amounts of time. (See Ideas. Code. Data. Implement. Measure. Learn for more on why this savings is so valuable)


Hypothesis testing vs hypothesis generation
I have sometimes opined that split-testing is the "gold standard" of customer feedback. This gets me into trouble, because it conjures up for some the idea that product development is simply a rote mechanical exercise of linear optimization. You just constantly test little micro-changes and follow a hill-climbing algorithm to build your product. This is not what I have in mind. Split-testing is ideal when you want to put your ideas to the test, to find out whether what you think is really what customers want. But where do those ideas come from in the first place? You need to make sure you don't get away from trying bold new things, using some combination of your vision and in-depth customer conversations to come up with the next idea to try. Split-testing doesn't have to be limited to micro-optimizations, either. You can use it to test out large changes as well as small. That's why it's important to keep the reporting focused on the macro statistics that you care about. Sometimes, small changes make a big difference. Other times, large changes make no difference at all. Split-testing can help you tell which is which.

Further reading
The best paper I have read on split-testing is "Practical Guide to Controlled Experiments on the Web: Listen to Your Customers not to the HiPPO" - it describes the techniques and rationale used for experiments at Amazon. One of the key lessons they emphasize is that, in the absence of data about what customers want, companies generally revert to the Highest Paid Person's Opinion (hence, HiPPO). But an even more important idea is that it's important to have the discipline to insist that any product change that doesn't change metrics in a positive direction should be reverted. Even if the change is "only neutral" and you really, really, really like it better, force yourself (and your team) to go back to the drawing board and try again. When you started working on that change, surely you had some idea in mind of what it would accomplish for your business. Check your assumptions, what went wrong? Why did customers like your change so much that they didn't change their behavior one iota?

Read More »