How a Structured Testing Plan Leads to Quick & Stable Wins

If you were Amazon.com CEO Jeff Bezos, how would you structure your testing and experimentation process to drive growth?

Let’s look at what Bezos says about experimenting (emphasis mine):

“One area where I think we are especially distinctive is failure. I believe we are the best place in the world to fail (we have plenty of practice!), and failure and invention are inseparable twins. To invent you have to experiment, and if you know in advance that it’s going to work, it’s not an experiment. Most large organizations embrace the idea of invention, but are not willing to suffer the string of failed experiments necessary to get there.
Outsized returns often come from betting against conventional wisdom, and conventional wisdom is usually right. Given a 10% chance of a 100-times payoff, you should take that bet every time. But you’re still going to be wrong nine times out of 10. We all know that if you swing for the fences, you’re going to strike out a lot, but you’re also going to hit some home runs. The difference between baseball and business, however, is that baseball has a truncated outcome distribution. When you swing, no matter how well you connect with the ball, the most runs you can get is four. In business, every once in a while, when you step up to the plate, you can score 1,000 runs. This long-tailed distribution of returns is why it’s important to be bold. Big winners pay for so many experiments.”

As CEO of Amazon.com, if not the world’s first, than certainly the largest, and the most successful e-commerce business (which by now is involved in industries far beyond retail), Bezos convincingly puts forward the case for adopting a test culture in any e-commerce environment.

In this post, we’ll look at how you can structure your in-house e-commerce CRO program and create a testing plan that grows with your organization.

You might not be Amazon… but why not swing for the fences?

Plan to Fail (and Learn From it)

The process of conversion rate optimization, or CRO, aims to make e-commerce companies more profitable by increasing the proportion of purchasers to total visitors.

A structured process — encompassing research and hypothesis creation, testing itself, and the prioritization and documentation of those tests — is crucial to creating a testing culture that produces sustainable long-term results.

In most of these steps, the need for a plan is obvious. But most people don’t plan for the testing phase. In fact, testing is frequently regarded as an end in itself.

However, testing is just the culmination of the entire process that stands behind it. Its real end goal is to increase revenue.

In the same way that it’s not possible to formulate and create tests without prior research, it’s also not possible to run tests without planning. And moving from conducting individual tests or a sequence of tests to full-scale, constantly active testing is what separates a one-off CRO sprint from a thought-out, deliberate CRO program.

Guess which approach is better for establishing a testing culture that enables companies to grow while absorbing their mistakes?

Making mistakes and failures as an integral part of growth means embracing the main components of any learning process. Each experiment, no matter how successful or unsuccessful, is a learning opportunity for you and your organization. Implementing and integrating the knowledge that results from your tests is one of the primary tasks of a viable CRO testing program.

Just a few reasons you should structure and document your testing program…

Testing every aspect of your website also enables you to challenge your prior assumptions by grounding alternative assumptions in data — instead of opinions or wild guesses.
Experimentation allows you to estimate the results of all improvements in real time, without having to wait for the end of the quarter to see improvement (or lack thereof).
By applying deliberate structure to the testing process, you make it easier to follow, teach, and repeat.

All of this makes conversion optimization testing a pivotal consideration for any business with ambitions of growth. One of the most efficient ways to set yourself up for e-commerce CRO success is to establish an ongoing process within your organization, with a specific, dedicated team.

This requires you to consider CRO not as an a la carte service provided by an agency, but as an opportunity to institutionalize and embrace the CRO process. And it requires that you learn to conduct tests yourself.

Why is a Testing Program a Necessity?

Note: If you want to test one hypothesis at time, you can go ahead and skip this section.

Why? If you’re running one test at a time, your testing plan and program will be the same as the hypothesis prioritization list (which we’ll talk about below). There’s just one small issue that may bother you — the time required to put all your hypotheses to the test.

If you choose to go the one-test-at-a-time route, be prepared to spend some time on the journey. The best-case scenario, if you have 25 hypotheses to test, is that you’re looking at two years of testing. Why would it take two years? The recommended practice is to run each experiment for at least a month (or until the test reaches significance and/or covers a few buying cycles) to ensure valid test results.

“Significance” is a statistical concept that allows you to conclude that the result of an experiment was actually caused by the changes made to the variation, and not by a random influence. It’s key to ensuring that tests are actually valid and that their results are sustainable and repeatable.

Alex Birkett, Content Editor for Conversion XL, explains the concept of significance a bit more in-depth:

“What we’re worried about is the representativeness of our sample. How can we do that in basic terms? Your test should run for two business cycles, so it includes everything external that’s going on:
– Every day of the week (and tested one week at a time as your daily traffic can vary a lot)
– Various different traffic sources (unless you want to personalize the experience for a dedicated source)
– Your blog post and newsletter publishing schedule
– People who visited your site, thought about it, and then came back 10 days later to buy [your product]
– Any external event that might affect purchasing (e.g. payday)”

The 1-month rule above holds true for most websites. Those with exceptionally high traffic (ranging into millions of unique visits) will undoubtedly be able to achieve significant results within shorter periods. Still, to eliminate every outside influence, it is best to let tests run for at least a full week or two.

Say you have 37 different hypotheses to test. Your ideal aim is probably to create all 37 tests and conduct them all at once, as an alternative to going through the process of testing one by one.

Sadly, this isn’t possible either, for a different reason. Sometimes the experiments themselves will conflict with one another, limiting their usefulness or even invalidating each other’s results.

Since none of us want to be old men when our conversion optimization efforts reach fruition, we need an alternative. That’s where the concept of testing velocity comes in. Testing velocity is an indicator of how many tests you conduct at a given time frame, such as a month. It is one of the metrics of testing program efficiency and higher the velocity you achieve, the quicker your program will bring increased revenue. Provided, of course, you do everything right.

This is the simplified process of creating a testing program

The Building Blocks of Your Testing Program

The main elements that will determine the dynamics of your testing program are:

Traffic volume
Interdependency of tests
The ability to support the design and implementation of multiple tests at once (operational constraint)

Let’s quickly go through what each of these elements means.

Traffic Volume

Traffic volume is an obvious obstacle, since your website traffic will influence not only what types of tests you can run, but also how many concurrent tests, and which pages will draw enough traffic to support tests.

Traffic volume is the reason to prioritize tests that have the greatest projected effect. Tests with higher expected lift have much lower requirements in terms of the sample size/traffic volume needed to reach statistical significance.

In practice, this means that if we expect a test to result in an increase in conversions of, for example, higher than 25%, we will need fewer observations to confirm this expectation than if we were expecting a 10% increase. This is the consequence of using a T-test as the statistical engine for running experiments: the smaller the effect of a change, the larger the sample needs to be in order to eliminate all outliers and reach statistical significance and confidence.

Interdependency of Tests

The ability to run experiments concurrently is the function of each experiment’s dependency on the others. What does this mean?

The basic principle is that we want to test a new page treatment on the maximum available number of visitors. If you happen to set up an experiment that will filter people out of the next experiment, then you will not be abiding by this basic principle.

If your visitors are split 50% on an initial page, meaning that half do not get to see the next page that’s also being experimented on, you will not have a valid test result.

For example, you may want to improve your funnel. So you create experimental treatments (variations) that will run on two different steps of the funnel. This may mean that the visitors that are shown one page do not get to see the other — because the experiment’s outcome has influenced how many people get to see the other experiment you are running.

Your sample will automatically be 50% smaller, meaning the test will have to run twice as long as it otherwise would have needed to achieve significance.

Running concurrent experiments can cause interdependency issues

To prevent this issue, estimate the interdependency risk prior to creating an experiment, and run interdependent experiments separately. You can sometimes solve this issue by using multivariate tests (MVTs), but sometimes your traffic volume will preclude this. Additionally, too many variants in MVTs can invalidate the experiment results.

Operational Ability — How Many Tests Can You Design and Actively Run?

In an ideal world, we would all be testing all the hypotheses we’ve created just as soon as the research is complete!

However, creating and running an experiment is hard work. It requires efforts from multiple people to create a viable and functional test. Once the research results are in and you have framed your hypothesis, the experiment won’t just spring into existence.

Making an experiment requires preparation. At minimum, you need to:

Sketch out an updated visual design, which you’ll use to create a mockup or high-fidelity wireframe
Create an actual design based on the mockup
Code the design/copy changes
Perform a quality assurance check and do a dry run before the test is live

All this requires time and effort by a team of people, and some of the steps cannot even begin before the previous ones are complete. This is your operational limitation.

You can overcome operational limitations by either hiring more people or limiting the number of tests you run.

Adjust Testing for Outside Influences

While it would be great if every experiment happened in a vacuum, this just isn’t the case. Website experiments performed for the purposes of conversion optimization will never enjoy the controlled environment of scientific experiments — where the experimenter can maintain control on all other influences outside of the one being intentionally changed.

However, we can at least account for obvious or expected test influences, such as holidays that affect the shopping habits of our customers or other predictable events that may change buyer behavior. By taking these factors into account when framing your plan, you can adjust for this and run the experiments at a time when the risk of outside influence is smaller.

Even More Benefits of Creating a Testing Plan

Having a testing plan not only makes your CRO process faster and more effective — it has a number of important additional benefits.

Let’s start with the benefit that’s most important in the long run. A test plan structures and standardizes your approach, making it repeatable and predictable.

An active, structured testing process with no expiry date essentially creates a positive feedback loop, so that even when your testing plan reaches its conclusion, you’ll feel encouraged to seek new challenges and run more tests.

In the long run, this leads to the establishment of a bona fide testing culture within your organization.

A structured process also allows for better feedback on the results. At each phase’s conclusion, you can review the results, update your expectations for the next phase, or adjust experiments that failed in the previous phase. In effect, you’re “learning as you go”.

Finally, a testing plan just plain-and-simple allows for better reporting and makes a more persuasive case for conversion optimization as an organizational must. If you are able to report progress in monthly increments, with results clearly attributed to experiments (which were built on hypotheses, which were derived from research), you’re much more likely to gain organizational support for your CRO program.

A testing plan creates clear milestones and enables the research team to accurately track progress, plan future activities, and remove potential bottlenecks in deploying and implementing experiments. That way, the chance that the testing process may spiral out of control is completely sidestepped, and each team member’s role is clear.

How to Structure Your Testing Plan

We’ve just explored why you need to make a testing plan prior to actual testing — let’s call that step zero, if you will. Now let’s talk about the nuts and bolts of creating that plan.

First, figure out what type of test(s) (A/B test, MVT, or bandit) you’ll run. Test type determines how much traffic you need, as well as the development effort necessary to deploy experiments.

Next, you need to carefully estimate the interdependency of your tests and make adjustments to your priority list if any tests clash with each other.

Finally, to determine the number of experiments you can run, estimate how many you can effectively support with available staff. Take into account that you need to have researchers framing hypotheses, designers and front-end developers to create variations and setup the experiment itself. Since each of these groups will have a number of tasks to attend to, you need to make sure you run only so many tests that your staff can support.

To ensure this, start by going through your list of hypotheses. If you prioritize tests accurately according to the effort necessary to deploy them, you’ll already have many of the inputs for your test plan.

Ultimately, your testing plan should take the form of Gantt charts, which are very helpful in indicating the time frame for each test phase.

A test program is usually presented in the form of a Gantt chart

A “test phase” contains all the tests that can be run simultaneously. For example, if you discover you can run four tests simultaneously, and you have 22 tests to run based on your hypotheses, you’ll have 5 test phases.

Your test plan should also list every proposed test and provide the following concise information for each:

Related hypothesis (the “why” of the test)
Required sample size
Expected effect
Who will be the subject (target segment or audience)
Where it will run (URL of the page)
When (the time period in which it will run)
Rough description of changes (the “what” of the test)
How to measure success (what metrics the experiment should improve/affect to be considered a success)

If you structure your testing plan this way, you will maximize your test velocity and allow for maximum efficiency of your optimization program.

How to Prioritize and Assign Testing Tasks

Once you create and structure a plan, the only remaining ingredient necessary for success is to actually run through the process.

Obviously, both to secure the greatest possible revenue and to create initial confidence, the first tests you run should be those you expect to have the greatest effect. Select the hypotheses that have high importance (for example, issues that affect your users’ movement through the funnel); that you are most confident will work; and that require the least effort to implement.

You can choose a prioritization model to apply to hypotheses during the research process. Apply the model properly and if your estimates are correct, you will almost certainly get the results you’re looking for.

For each experiment to succeed, you need to translate hypothetical solutions into practical web page designs as accurately as you can.

When you have a mental image of the variation you want to test, translate that into a visual image using a wireframe or mockup. Hand that off to your designers, who can turn it into an actual web page.

While the visual design is being prepared, your front-end developers need to check if any additional coding will be necessary to implement the variation.

The most important part of implementing an experiment is to ensure that it’s set up free of any technical issues. Do this by making quality-assurance protocols and checks part of your testing program.

Once a given step in the experiment development cycle is complete, staff involved with that step can immediately start working on the following experiment. Having a plan enables them to advance further without any delay, and adds to the efficiency of your conversion optimization effort.

Establishing a Culture of Experimentation

Building a testing culture is the main objective of a structured CRO process. A testing culture requires the company to make a switch from a risk-averse and slow-decision-making mindset to a faster, risk-taking approach. This is possible because testing enables you to make decisions based on measurable, known quantities — in effect reducing your risk.

Extensive research is a necessary prerequisite of successful A/B testing (which is something that hopefully, a majority of people involved in testing already understand)! Suffice it to say that the role of research is well publicized, and there are a number of articles about it.

We will also assume that by now, you know how to frame a hypothesis from this research. The hypothesis creation process is just as important to the ultimate success of your CRO effort as running the tests themselves. Only properly framed, strong hypotheses will result in conclusive A/B tests.

In a structured CRO effort, no element should be left to chance. Extend the same careful treatment to actual testing as you afford to research and hypothesis creation. Once you’ve properly prioritized your hypotheses by the effort each will take, their importance, and their expected effect, you need to prepare your tests with the same forethought.

How you approach setting up your testing program will greatly impact your end results. The aim of every good testing program is to attain the maximum test velocity and see meaningful test results in the shortest possible time.

About the Author: Edin Šabanović is a senior CRO consultant working for Objeqt. He helps e-commerce stores improve their conversion rates through analytics, scientific research, and A/B testing. Edin is passionate about analytics and conversion rate optimization, but for fun, he likes reading history books. He can help you grow your e-commerce business using Objeqt’s tailored, data-driven CRO methodology. Get in touch if you want someone to take care of your CRO efforts.

Source link