Continuous deployment with downloads

One of my goals in writing posts about topics like continuous deployment is the hope that people will take those ideas and apply them to new situations - and then share what they learn with the rest of us. So I was excited to read a recent post about applying the concept of continuous deployment to that thickest-of-all-clients, the MMOG. Joe Ludwig goes through his current release process and determines that the current time to make and deploy a release is about seven and half hours, which is why his product is released about once a month. While that's actually quite speedy for a MMOG, Joe goes through the thought experiment of what it would take to do it much faster:
Programmer Joe � Continuous Deployment with Thick Clients
If it takes seven and a half hours to deploy a new build you obviously aren’t going to get more than one of them out in an 8 hour work day. Let’s forget for a moment that IMVU is able to do this in 15 minutes and pretend that our target is an hour. For now let’s assume that we will spend 10 minutes on building, 10 minutes on automated testing, 30 minutes on manual testing ...

In fact, if those 30 minutes of manual testing are your bottleneck and you can keep the pipeline full, you can push a fresh build every 30 minutes or 16 times a day. Forget entirely about pushing to live for a moment and consider what kind of impact that would have on your test server. Your team could focus on fixing issues that players on the test server find while those players are still online. Assuming a small change that you can make in half an hour, it would take only an hour from the start of the work on that fix to when it is visible to players. That pace is fast enough that it would be possible to run experiments with tuning values, prices of items, or even algorithms.Of course for any of this to work the entire organization needs to be arranged around responding to player feedback multiple times per day. The real advantage of a rapid deployment system is to make your change -> test -> respond loop faster.
This is a great example of lean startup thinking. Joe is getting clear about what steps in the current process actually deliver value to the company, the imagining a world in which those steps were emphasized and others minimized. Of course, as soon as you do that, you start to reap other benefits, too.

I'd like to add one extra thought to Joe's thought experiment. Let's start with a distinction between shipping new software to the customer, and changing the customer's experience. The idea is that often you can change the customer's experience without shipping them new software at all. This is one of the most powerful aspects of web architecture, and it often gets lost in other client-server programming paradigms.

From one point of view, web browsers are a horribly inefficient platform. We often send down complete instructions for rendering a whole page (or even a series of pages) in response to every single click. Worse, those instructions often kick off additional requests back and forth. It would be a lot more efficient to send down a compressed packet with the entire site's data and presentation in an optimized format. Then we could render the whole site with much less latency, bandwidth usage, and server cost.

Of course, the web doesn't work this way for good reasons. Its design goals aren't geared towards efficiency in terms of technical costs. Instead, it's focused on flexibility, readability, and ease of interoperability. For example, it's quite common that we don't know the exact set of assets a given customer is going to want to use. By deferring their selection until later in the process, we can give up a lot of bookkeeping (again trading off for considerable costs). As a nice side-effect, it's also an ideal platform for rapid changes, because you can "update the software" in real time without the end-user even needing to be aware of the changes.

Some conclude that this phenomenon is made possible because the web browser is a fully general-purpose rendering platform, and assume that it'd be impossible to do this in their app without creating that same level of generality. But I think it's more productive to think of this as a spectrum. You can always move logic changes a little further "upstream" closer to the source of the code that is flowing to customers. Incidentally, this is especially important for iPhone developers, who are barred by Apple Decree from embedding a programming language or interpreter in their app (but who are allowed to request structured data from the server).

For example, at IMVU we would often run split-test experiments that affected the behavior of our downloadable client. Although we had the ability to do new releases of the client on a daily basis (more on this in a moment), this was actually too slow for most of the experiments we wanted to run. Plus, having the customer be aware that a new feature is part of a new release actually affects the validity of the experiment. So we would often ship a client that had multiple versions of a feature baked into it, and have the client call home to find out which version to show to any given customer. This added to the code complexity, latency, and server cost of a given release, but it was more than paid back by our ability to tune and tweak the experimental branches in near-real-time.

Further upstream on the spectrum are features that can be parameterized. Common examples are random events which have some numeric weighting associated with them (and which can be tuned) or user interface elements that are composed of text or graphics. We tacked on a feature to the IMVU client that worked like this: whenever the client called home to report a data-warehousing event, we used to have a useless return field (the client doesn't care if the event was successfully recorded). We repurposed that field to optionally include some XML describing an on-screen dialog box. That meant we could notify some percentage of customers of something at any time, which was great for split-testing. A new feature would often be first "implemented" by a dialog box shown to a few percent of the userbase. We'd pay attention to how many clicked the embedded link to get an early read on how much they cared about the feature at all. Often, we'd do this before the feature existed at all, apologizing all the way.

There are plenty more techniques even further upstream. Eventually, you wind up with specialized state machines, interpeters or a full-fledged embedded platform. We eventually embedded the Flash interpreter into our process, so we could experiment with our UI more quickly.

In fact, we considered releases themselves to be a special case of this more general system. We had a structured automated release process. After all, the release itself was just a static file checked into our website source control. Every new candidate release was automatically shown to a small number of volunteers, who would be prompted to upgrade. The system would monitor their data and if it looked within norms gradually offer the release to a small number of new users (who had no prior expectation of how the product should work). It would carefully monitor their behavior, and especially their technical metrics, like crashes and freezes. If their data looked OK, we'd have the option to ramp up the number of customers bit by bit until finally all new users were given the new release and all existing users were promtped to upgrade. Although we'd generally do a prerelease every day, we wouldn't pull the trigger on a full release that often, because our upgrade path for existing users wasn't (yet) without cost. It also gave our manual QA team a chance to inspect the client before it was widely deployed, due to the lower level of test coverage we have on that part of the product (it's much harder).

In effect, every time we check in code to our client code base, we are kicking off another split-test experiment that asks: "is the business better off with this change in it than without it?" Because of the mechanics of our download process, this is answered a little slower than on the web. But that doesn't make it any less important to answer.

To return to the case of the thick-client MMOG, there are some additional design constraints. It's more risky to have different players using different versions of the software, because that might introduce gameplay fairness issues. I think Joe's idea of deploying to the test server is a great way around this, especially if there is a regular crew of players subjecting the test server to near-normal behavior. But I also think this could work in a lot of production scenarios. Take the case of determining the optimal spawn rate for a certain monster or treasure. Since this is a parameterized scenario, it should be possible to do time-based experiments. Change the value periodically, and have a process for measuring the subsequent behavior of customers who were subjected to each unique value. If you want to be really fancy, you can segregate the data for customers who saw multiple variations. I bet you'd be able to write a simple linear optimizer to answer a lot of design questions that way.

My experience is that once you have a tool like this, you start to use it more and more. Even better, you start to migrate your designs to be able to take ever-increasing advantage of it. At IMVU, we found ourselves constantly migrating functionality upstream, in order to get faster iteration. It was a nearly uncoscious process; we just felt that much more productive with rapid feedback.

So thanks for sharing Joe! Good luck with your thought experiment, and let us know if you ever decide to make those changes a reality.
Reblog this post [with Zemanta]

0 comments:

welcome to my blog. please write some comment about this article ^_^