43 Tips For Running Better Experiments

Estimate sample upfront for more confidence

Use a sample size calculator to figure out how many visitors you need based on conversion rate, desired outcome, and risk tolerance. A planned sample size goal will give you a measure of confidence in your data and an objective criterion for stopping the experiment.

Example: Your conversion rate is 20% and you want to see if your variations managed to lift that by at least 10%. The calculator will show that you can expect to detect a statistically significant lift with 6,000 visitors per variation (including control) 80% of the time. This is called "power analysis". If you ran the experiment with 1,000 visitors and saw a statistically significant 60% lift, you would know you're still far from your original plan of 6,000. You might be sceptical of this and decide to run your experiment longer, by which time the lift might drop to 20%.

Be realistic. Analysis can show that an experiment has no chance of success. In that case, adjust your tactics and avoid running a futile experiment.

To detect smaller effects or have a higher chance of success, you'll need more visitors.

If time is constrained, you'll need to increase risk of failure or aim for less definitive results.

Delay the redirect for more accurate tracking

Browsers lose many click or submit goals when redirecting to another page. It depends on factors like page size, other scripts, and the browser. Adding even an imperceptible delay can give enough time for your goal to fire. This means stronger results and/or shorter experiment.

For clicks on links and AJAX submit buttons, we recommend a Custom Conversion trigger with a setTimeout() delay of zero to 250ms:


						$("a.button").mousedown(function(e) {

							  e.preventDefault(); // stop default behavior

							  $(".progress-indicator").show(); // show gif spinner

							  yourCustomConversionTrigger(); 

							  setTimeout(function(){ // After a short delay, manually redirect

							   window.location.href = $(this).attr("href");

							  }, 250);

						});

For submits of on non-AJAX forms, you can delay and submit the form manually using a flag:


						var allowSubmit = false; // block submit the first time 

						$("form").submit(function() {

							  if(!allowSubmit) { 

							    e.preventDefault(); // stop default behavior

							    allowSubmit = true; // now allow form to submit

							    $(".progress-indicator").show(); // show gif spinner

							    yourCustomConversionTrigger(); 

							    setTimeout(function(){ // after a short delay, trigger submit

							      $("form").submit();

							    }, 250);
						
							  } 

						});

Try using existing delays. For example, if your site already pauses for a second while it makes a server call to validate the form, insert your Custom Conversion trigger after JavaScript validates the form but before confirmation from the server.

Use the zero delay trick to force the redirect to the bottom of the browser's event queue. If you add a delay of zero with setTimeout(), that can by itself increase your goal tracking accuracy.

Track page visits for more accuracy

Make page visits your primary goal. In our tests, we've seen visit goals up to 50% more accurate than clicks or form submits. Ensure there is only one way to get to the unique goal page URL once a visitor is part of the experiment.

Add a URL parameter that uniquely ties your goal page URL to the page you're testing. For example, setting visits to /goalpage?from=home as your primary goal and directing experiment participants there ensures that visitors who bypass your experiment and land on /goalpage don't count as conversions.

Clicks as a secondary metric can be useful. For example, higher clicks could mean your form validation design is stopping users from getting to the goal page (e.g., captchas and inflexible format requirements)

If a user is likely to immediately close the goal page or to click a link to another page, it is possible the page visit won't get tracked. Just in case, ensure your goal page can keep a visitor for 1-2 sec.

Break down the funnel to identify bottlenecks

Track conversions at each step in the funnel. That way you can confirm you've raised conversions to the immediate next step and see whether the effect persisted to the next pages or got lost in a bottleneck. You could also see if your seemingly beneficial change at Step 1 created a bottleneck at Step 2.

Example: You have a 3-step sign-up process from "1. Home Page" to "2. Additional Details" to "3. Confirmation". You run an experiment on Home and see a significant 20% increase in traffic to Step 2, but traffic from Step 2 to Step 3 only went up 1%. It means Step 2 is a bottleneck.

If traffic to Step 3 were to fall by 50%, it could mean your change to the Home page (e.g., removing a key piece of information) created a problem at Step 2. Either way, tracking the entire funnel shows you cannot conclude that you've raised conversions through the funnel with this one experiment.

For AJAX websites, where several steps of the funnel are actually one the same page, use Custom Event code provided by your tool vendor to track visits to each step.

Conversion rate drops across the funnel as more people drop out from the total visitors. At the same time visitors get farther removed from the changes you are testing on Step 1. This means your metric lower in the funnel has a higher margin of error and is more contaminated. In other words, any effect you detect at Step 3 is less likely to be caused by your changes to Step 1.

Set Immediate Goal as Primary for more accuracy

The primary goal should be on the page you are testing or the immediate next page. If you have 5 steps in your funnel, and you're testing a change in Step 1, you should track visits to Step 2 as your primary goal. This gives you the greatest chance of detecting a significant change.

If you track visitors from Step 1 to conversions at Step 5, for example, you'll run your experiment longer to get reliable data. One reason is that you'd be measuring the smallest conversion rate, since visitors drop off at each step. There is more noise from the intervening steps that can distort what you're trying to measure.

When improving Step 1, track other steps in the funnel as secondary metrics and be sceptical

Target each step with a separate experiment. So to improve Step 3, start your test on Step 3 and track visits to Step 4. This raises your conversion rate and gives reliable data sooner.

Keep reading

Include Redundant Goals for more accuracy

Track the same primary metric in more than one way. For instance, always track clicks on Step 1 and Visits to Step 2. Even page visit goals are not fool-proof, and in testing you may find situations where visits do not get tracked as often as they should.

Plain clicks are not equivalent to visits (because someone can click a button without filling in a form for example), though they are still useful if there is no other alternative. If you are testing a form, the better option is to set up a smarter Custom click goal that fires after validation on Step 1. This metric would be equivalent to visits to Step 2.

Once or twice, we've found a smart click goal outperformed the visit goal on one variation but not other by a few conversions. In these situations, we take the highest numbers from either goal, since it's the more accurate (assuming the goals are truly equivalent).

Use Naming Conventions for easy identification

Instead of naming a variation "B" or "Variation 2", try "B: Larger button", so you can identify it at a glance. Instead of naming a goal "Goal 1", try "1: Clicks primary", "1: Clicks secondary", "2: Arrives on payment", where 1/2 designate page number in the funnel and clicks/arrives designates the type of goal.

Bind to mousedown for more accuracy

A mousedown event fires slightly faster than the click event. If you are binding Custom goals to your event, use the mousedown event. A few milliseconds can ensure the event fires in time before the browser redirects and can prevent dropped events in Chrome and Safari, especially with events bound to links.

Test your experiment to catch errors early

Test each variation to make sure it's behaving properly. Submit each form and make sure the data is in the database and your 3rd-party analytics look right. Then launch the experiment internally (so only your team sees it) and check that each type of goal is being tracked.

Some issues are transient, so test at least twice, clearing cookies and cache in between.

If you catch errors and have to stop the experiment, it is best to duplicate the experiment and start a clean version.

You can QA an experiment safely in production by restricting it to your IP or a URL parameter like ?include=true.

Exclude Existing Users for more confidence

Existing customers can add noise. If they are not potential buyers, you should exclude their visits from your sample size. Implement a cookie that remains even if the customer is logged off. You can then set your test segmentation criteria to target visitors who do not have this cookie. Since this reduces visitor count while sales stay the same, this reduces the margin of error in your results (increases confidence).

Cookies are not fool-proof. If existing customers clear cookies or share their login or use multiple devices, their visits will still count when they come to your page. You can exclude these visitors post-hoc by firing a custom goal when the customer logs in. This way, you can manually subtract the number of logins from total visits. Make sure your primary metric cannot be fired by existing customers visiting the page. For example, existing customers must not be able to visit the goal page URL you're using to track new customers.

Is it worth it? The larger your sample size, the more noise existing customers contribute. However, this will only impact your results if you have a huge sample size (tens of thousands), a high conversion rate (e.g., 10%), or a high return rate (e.g., 30%). There are two reasons you should implement this: (1) A high login rate represents an active user base, so it's a useful metric to keep and (2) It's best to confirm the impact for existing customers instead of guessing. For example, say you get 100 sales and 1000 total visitors on your Control (10% rate) and were after a 20% lift. If you tracked 300 logins, the base purchase rate becomes 100 out of 700 (14% rate). As a result, you would need 1600 fewer visitors for this test.

If you have enough traffic, you may want to exclude visitors who were exposed to previous tests. You can tell VWO to target new visitors only. You can additionally set up a separate, parallel test to target the returning visitor segment to see if they respond differently to your new page.

If you are testing internal pages, you'll want to do the reverse - target existing customers and exclude everyone else.

If you have a cookie to track this already, you may need to modify its values. For example, VWO will not work with cookies that contain multiple values. Keep your cookies focused on one piece of information.

Target New Visitors for more confidence

If you have run many experiments on your site, you may not want returning visitors to join your current experiment. Visitors who notice a change will behave differently than new visitors. We have seen conversion rates differ by up to 400% depending on whether returning visitors are included or not.

If you keep targeting new visitors in experiments, you will notice your traffic and conversion rate change as you run out of new visitors (in VWO, a new visitor has not been part of any experiment). At that point, you may reopen experiments to all traffic.

Keep reading

Exclude Browsers for more accuracy

Look through your visitors stats and target the most common browsers. Test your design thoroughly on each browser to avoid contaminating your experiment with browser effects. Unless your visitors are heavy users of IE and you test thoroughly, you should target IE users in a separate experiment run in parallel (unless you A/B tool has segmenting features).

Use CSS3 and JavaScript features that your target browsers support.

Separate Mobile and Desktop for more confidence

Make sure each visitor to your experiment has an equal chance to see the same page. A mobile visitor will essentially see a different page than a desktop visitor. So, you should always target an experiment to only mobile or only desktop. Your tool may exclude mobile by default.

If you Control page is responsive, make sure your variations are responsive as well. All your visitors should have the same experience, except for the effect you are testing.

Separate Competing Goals to reveal relationships

If you are tracking user choices that compete with each other, use a deeper goal to ensure the metrics are mutually exclusive. If you were to track clicks or intermediate page visits, you could be double-counting, since nothing stops a user from going back and triggering a different goal too.

For shallower metrics, try tracking how often users go back and change their mind as a secondary metric. You can also add goals that track just the user's first choice and separately their final choice. Identifying why people change their minds can itself be a valuable insight and can help you interpret the effect of double-counting in your shallower metrics.

All A/B testing tools we know of won't count multiple activation of the same goal, so there is no risk of double-counting there.

Track time on site for greater insight

It can be useful to know not merely that an event happened but how long it took for a user. You can add goals to your experiment to track things like duration of the page visit or how long it takes a user to start or complete a form. To do this, frame the time as a binary goal, such as "User has been on the page for 2 minutes". You can then set a timer and fire the conversion once the target time has been reached:


					setTimeout(function() { trigger_2min_goal(); }, 120000);

Have a good reason for tracking the time. You hypothesis might be: "If people stay on the page longer, they are more likely to read the content and make a purchase" or "If people are not able to complete the form in 1 min, they will quit".

Try Value Instead of Revenue for greater insight

You can track value using a revenue goal. For example, a $100 plan might actually be more valuable to your business than a $200 plan, because the $100/month plan leads to greater Customer Lifetime Value. You can also assign a value to a free product other than zero. If 5% of your free users upgrade to a $100 plan, than a free plan is really worth $5. This way you can see if a statistically significant trade-off between free and paid plans will benefit your business.

Track Both Revenue & Choices for greater insight

For a purchase, use a revenue goal on the purchase confirmation page, where a dollar value has been assigned. This brings together competing goals into one handy metric that tells you if the change benefits your business or not. It can also reveal situations where the total sales volume is stays unaffected but revenue changes, because the user's choice has shifted to a higher or lower value choice.

However, revenue is not always appropriate and does not always tell the whole story (e.g., if multiple products are purchased and you need to track which, or if the dollar value can be zero). Add a URL parameter to the goal page to track not just the dollar value but track changes in users' choices (e.g., ?plan=free).

If you are tracking a URL based on multiple parameters, make sure the order of parameters is fixed. For instance, if you want to track Plan A purchases of Product B with parameter ?plan=A&product=B, make sure the parameters always go in that order or the URL won't match.

Eliminate flicker in A/B tests for more accuracy

When you set up an A/B experiment, your A/B tool will inject changes into the existing page, turning it into one of the variations (this doesn't apply to back-end tools). Most tools will prevent displaying a page until it is ready, but this is not always fool proof. The page can momentarily flicker or briefly show original page elements. If users notice this and behave differently, you experiment will be invalid. Test for the flickering effect and ascertain if any remaining effect might skew your experiment results (clear cache in between and try different browsers).

Using your tool's built-in editor, while less flexible, should reduce flicker better than injecting custom CSS or JavaScript. In VWO, a workaround is to use the built-in editor to hide and then show an element and only then inject custom CSS and JavaScript to modify it. This tells VWO to hold rending this element until it is ready.

Another way to minimize flicker is to optimize your code and reduce amount of injected code. If possible, insert Variation Content directly into your production site, tag it with an id or class like "variation1", and hide it with CSS. Then tag the Control content with class = "control". Now, instead of injecting lots of code, your A/B experiment just shows the "variation1" class and hides the "control" class.

Ignore inconclusive results

An experiment of statistical significance (p-value) will tell you if your result is real or likely to be the result of chance. If the difference is not statistically significant, you can't say a variation won, lost, or is the same. A p-value of 0.5 does not mean it's a 50/50 chance of winning. A p-value close to 1 does not mean the variations are the same. Any high p-value simply means there is not enough evidence to make any determination.

An inconclusive negative result is not a loser. The only time you should say that a variation lost is when you detect a negative effect that is statistically significant.

A p-value can be low enough to be "suggestive". It is evidence but very weak. You should definitely note suggestive results and run an experiment confirm them. For example, if you're aiming for 95% confidence, a p-value of 0.1 is suggestive. Remember that even a p-value of 0.01 is not proof, just strong evidence.

With a sample that shows Control and Variation performing about the same, it is tempting to conclude they are the same. However, even with a large sample, there is a small probability that you won't detect a true effect just by chance. A sample size calculator allows you to mitigate this risk, called beta, by running your experiment longer. To say that variations truly perform about the same, you should have a large sample with confidence intervals that are narrow and almost completely overlapping.

Test single pages or flows for more confidence

Identify the specific page or sequence of pages you are testing. Avoid broad experiments where the landing page can be any of multiple pages that display substantial differences. Each visitor to your experiment should have an equal chance to see the exact same thing except for the differences you are testing. If random visitors see slightly different things, your metrics could be distorted by differences other than the ones you are measuring. It is the same rationale for targeting the experiment to desktop or mobile visitors but not both.

You can test a series of pages that share a template. You should run a template experiment longer than a single-page experiment, as it can show more variation.

You can test the outcome of a close sequence of pages, as long as a visitor has an equal chance to see either one version of the sequence or the other and never both. For example, you can test alternative versions of an entire funnel to see which generates more revenue.

To test multiple pages as a whole, you'll need to tie the pages together with cookies. For example, set up test #123 on Page 1 of your checkout. If a visitor is assigned to variation B, they get the cookie _vis_opt_exp_123_combi equal to 1. These visitors would then also see variation B of Page 2 of your checkout. The best way to implement this is to check the _vis_opt_exp_123_combi cookie server-side on the checkout page. If the cookie equals 1, serve them version B of Page 2. Another way that is sometimes convenient is to set up a separate test in VWO on Page 2 and set it to show the B variation 100% of the time but only to visitors who match the cookie criterion. In this case, if B won, it would mean the pages together performed better than the existing site, but you won't know which page individually did better than its control version.

Test fewer variations to avoid false positives

Cut non-essential variations and retest results that match what is expected by chance. A/B tests are ideal. We recommend that you avoid multivariate tests.

For each comparison you make, there is a risk that the winner is a false positive (called alpha or significance level). If you make multiple comparisons, whether by adding more variations or more experiments, be aware that the overall likelihood of finding a false positive is inflated.

Example: You use a sample size calculator to estimate the number of visitors you need for a 5% chance of a false positive. You run 20 variations over 2 tests and find a winner. You should retest this result, because at least one winner was to be expected just by chance (5% chance multiplied by 20 is a 100% chance). If you condensed 10 variations into 5 in each experiment, you would cut the probability of a false positive in half.

If you include numerous variations in the spirit of experimentation, that is valid. Just keep in mind the increased probability of finding a winner or loser just by chance. We have found that experiments with more than 4 variations (including the baseline) are more likely to run long, end up underpowered, and produce results that are hard to interpret.

A false positive can be any statistically significant effect. A losing variation may also be a false positive.

Track Shallow Goals to get data faster

If your site has low traffic or a low conversion rate, a sample size calculator will tell you you don't have a good chance of measuring the ideal metrics, like revenue. However, you can increase your conversion rate by measuring more shallow metrics instead, like clicks, scrolls, or searches, which are nonetheless solid indicators of desired behaviour.

Example: If people search products, they can find them and purchase them. Therefore, by increasing searches, we are likely to help sales. To raise your 1% sales conversion rate by 10%, you'd need 150K+ visitors. However, to raise your 50% search conversion rate by 10%, you'd only need 1-2K visitors.

Have a hypothesis for more confidence

If the outcome of your experiment confirms a hypothesis that you had stated upfront, it makes the outcome more trustworthy. In contrast, if you test many variations hoping to hit on something by chance or if you discover something by accident, there is greater risk of false positive.

If you did not state a hypothesis upfront or got some unexpected results, come up with a post-hoc hypothesis that explains the results. If your sample size is still inadequate, but the overall pattern among variations makes sense, it makes the result more trustworthy.

Example: Say you're running an ABCD experiment. A and B are visually similar, minor changes. C is a bigger redesign, which tests a hypothesis about what motivates your visitors. You therefore expect C to perform better. If A and B indeed perform similarly and C outperforms both, this result is trustworthy. On the other hand, if the results are unexpected, and B did best, then your scepticism would lead you to seek an alternative hypothesis to explain this.

A good hypothesis is a theory about the motivations, goals, and behavior of visitors. It does not describe what you plan to do but explains the "why" you are doing it. For example, this is not a hypothesis: "We can increase conversions by adding a security badge". A good formula is: IF [we remove dollar signs before our prices], THEN [people will spend longer on the page and be more likely to purchase], BECAUSE [it may be that dollar signs trigger negative associations for people].

Agree on drop rules for more confidence

Decide ahead of time what constitutes strong enough evidence against a variation that you can drop it. For example: "drop a variation if the p-value is at most 0.2, and we have at least 50% of our planned sample size". An early stopping rule like this allows you to make decisions rationally and consistently.

Stopping a losing variation mitigates the risk of further losses, but it comes at the risk of dropping a winner, wasting effort, missing insights, and even reaching the wrong conclusion. If you drop a variation when the evidence against it is still weak, you are saying "I’m not willing to find out for sure, because of perceived risk". As a result, you will always have lingering uncertainty about whether it really did better or worse.

We do not recommend you drop variations and or stop experiments early, unless the evidence is reasonably strong, and the cost of ongoing losses is high.

Refine the audience for more accuracy

Include only visitors who are likely to see and benefit from your variation instead of all visitors. Your tool likely provides segmentation options to do things like exclude existing users or target specific browsers. A more powerful method is deciding at what point users enter your experiment AFTER they have arrived on your site.

One technique is refining your target page. For example, say you have a single page AJAX site with multiple pseudo-pages or content hidden under different tabs on the same page. If you are making changes to pseudo-page 2, then only track visitors once they land there. Your tool should have the option to load the tracking code conditionally instead of on page visit. This may give you much more accurate data.

A similar technique is to track visitors based on scrolling, visit duration, or other behaviour. For instance, if you are testing a footer on a long page, you might want to only track people who have scrolled down enough to see it. Or say you are testing a pop-up that shows up after 10 seconds on the page. You could exclude anyone who visits the page and leaves within 10 seconds, because they are not your target audience.

If you do not refine your audience, it just means you'll need to run your experiment longer to see results at the same level of statistical confidence. For low traffic sites, refining your audience may allow you to run an experiment you might not be able to run otherwise.

Prepend !important for more accuracy

If you are injecting code in an A/B experiment to create your variations, add !important to all your CSS. This ensures that existing styles don't override the styles you are injecting if they happen to load last.

Know URL parameter order for more accuracy

If you are tracking page visits based on a combination of URL parameters, make sure you get the order of the parameters correct. Sometimes, the order of the parameters is not fixed. For instance, if you track visits to http://example.com/?plan=free&success=1, this won't match the URL http://example.com/?success=1&plan=free.

Track HTTPS and HTTP for more accuracy

If your site can be visited through both http and https, you need to track each one explicitly. In VWO, use the wildcard http* to ensure both http and https visitors are included.

Check Significance Yourself for more confidence

Most tool vendors show optimistic results. Winners are declared too early, and confidence intervals (margin of error) are shown at 80% level. Seeing stronger results and more winners keeps you motivated about testing. It also reduces your risk of missing true effects. The problem is you'll see many false positives and inflated effect sizes. To get a truer measure of confidence, use a tool like Abba to see 95% confidence intervals.

Even if 80% Confidence is sufficient to support your decisions, we recommend checking 95-99% Confidence Intervals to see the full extent of the margin of error. For example, an 80% confidence interval might show an effect in the 5% to 15% range (a winning variation), but a 95% confidence interval would show this effect is really in the -15% to 35% range, some possibility of being a losing variation.

Segment with caution for more confidence

Segmenting is very likely to produce false or exaggerated effects and to understate true effects. Each additional analysis you do on the data increases the chances of finding an effect by chance (inflates your alpha). At the same time, segmenting reduces sample size, which too increases false positive risk, tends to suggest exaggerated effects, and at the same time makes it harder to distinguish true effects (reduces power). These factors distort comparisons, especially of unequal segments.

Have a hypothesis before you segment to reduce these risks. Avoid the practice of segment-until-you-find.

Consider any significant differences you find between segments only "suggestive", especially if you did not have a hypothesis. Run a separate experiment to confirm any such effect.

If you find a sub-segment in your sample with a different conversion rate, you should ignore that if the degree of the effect is similar and points in the same direction, especially if the sample sizes are greatly unequal. The important thing is that all your segments are choosing the same variation.

Open links in new tab for greater accuracy

If your goal page is on a 3rd party server, you likely can't inject your tracking code on that page to track visits. An example is a link to PayPal checkout. Similarly, if a link points to a PDF document or a direct download, you can't track page visits. Your only option is to track clicks on the link.

When tracking clicks, you want to make sure the event can register before the browser redirects. One way is to open the link in a new window (see Tip 2 for alternative). On modern browsers, this will open the 3rd party resource in a new tab instead of redirecting, so the first tab has plenty of time to finish tracking the click.

To force a link to open in a new tab, add the target=_blank attribute to the link. You can do this directly in the HTML or dynamically using JavaScript:

$("a#download").attr("target", "_blank") // Using jQuery

Another way that will affect ALL links on the page is to add the <base> tag to the head of the page:

<base target="_blank">

If the link has a custom handler attached, you'll need to modify that:

$("a#download").click(function(e) { e.preventDefault(); window.open('http://anothersite.com', '_blank'); } // use open()'s _blank parameter

You must make this change on the Control as well, so tracking accuracy is the same as for all the variations. Tghis is a temporary measure. Once the test is over, you remove these changes.

Check weekly performance for greater confidence

Always check if the pattern in performance holds day to day, week to week, month to month. A statistically significant, sizeable lift that holds over time allows you to make the most confident predictions about the future. Statistical significance itself is not a reliable predictor, unless you've run your test over an extended period of time. If your sample size is large but you only ran the test for 1 hour, there's a high risk that your predictions won't hold for months (see Tip 1 for more on sample sizes).

In VWO, you can change the date filter to segment the test into days, weeks, and months. Look for variations that perform consistently better. Since the sub-sample sizes will be much smaller than your total, the degree of the effect will vary. Week 1 might show a 50% lift with low margin of error, week 2 show a 5% lift with huge margin of error, and week 3 show a 20% lift with moderate margin of error. The most important thing is that it's a consistently positive effect. In contrast, you might see week 1 with 50% lift, then week 2 with a 20% drop, and week 3 with a 1% lift. Overall, that might still be a statistically significant lift, because the lifts outweigh the drops, but it's too variable to be trustworthy (assuming the sub-sample size is decent).

The smaller your sub-samples, the less reliable they will be in a comparison (see Tip 30 on segmenting data). You want a subsample that has at least 40 conversions, assuming a larger lift. If your site has high daily traffic, you can check performance day-to-day and week-to-week. If your site has low traffic, your daily or even weekly sample might be too small for comparison. On low traffic sites, you'll typically run your test over several weeks and check performance week-to-week and month-to-month.

Check the daily performance graphs, assuming you get a few dozen conversion per day. The daily graphs will identify variations that are doing better day-to-day, though usually they are too variable to make sense of. The cumulative graphs reveal larger shifts in performance over time, concealing variability.

Ignore any test results that are completely inconsistent while weighing other factors like p-value, duration, sample size, and effect size. You can act on results that are not statistically significant but highly consistent. For example, you might see a smaller 8% lift over several weeks with a large margin of error. With your site traffic, you know you will never have enough visitors to confirm a lift that small. However, if you see that the lift holds day in and day out, that gives you stronger evidence in favour of that variation than a border-line significant results that is highly variable.

Test For Whole Weeks for greater confidence

Run tests for at least 1 full week (or retest on different days at different times), even if your traffic is high. You should not run tests for hours nor aim to run several tests per day.

Sample size is important but so is duration. Anything can affect user behavior - day of the week, time of day, holidays, weather, a surge of traffic from an unanticipated source. If you want your data to have greater predictive power, results need to hold over time.

If your testing service charges by the visitor, throttle your traffic to 50% or less to make sure you don't exceed your limit. If that is not an issue, you can run a concurrent test on the remaining traffic by adding mutual exclusion criteria to the two tests, to ensure that a visitor may enter one but not both tests.

If you need to run your test longer, make it 2 weeks, 3 weeks, and so on. Try to start and end your tests at the same time on the same day. If you start at 5pm on Tuesday, end at 5pm on Tuesday. Think in whole weeks.

Send variation with form for greater accuracy

If you have access to your database, submit the variation ID with your sign-up or payment form data to increase the accuracy of conversion rate tracking by 10%.

Tracking visits to the post-submit goal page is good enough in most cases but not fool-proof. The tracking on the goal page can fail to fire. If that happens, it will looks like the visitor came but didn't convert. In contrast, if the variation is being submitted with the form data, you don't have to rely just on the goal page. Do set up goal page tracking, but also add up the submits in the database for each variation so you can compare. It matters particularly if your traffic is on the low end and each conversion matters.

Here's what you need to do:

1. Create a new database field like variation_id

2. Duplicate your control. For example, say you want to test A vs. B, where A is your existing page. Duplicate A into A1 and A2.

3. Set up your test as A2 vs B. Not all visitors to your site are entered into a test. If someone is part of the test, they will see A2. If they are excluded for any reason (e.g., mobile traffic or tracking code times out), they will see A1, your normal page. That way you'll be able to ignore the A1's and count only A2 and B submits in your database.

4. Add a hidden form field to all pages. On A1, leave it blank (since A1 is not part of our test). On A2, set its value to "Test001-A2". On B, set its value to "Test001-B".

5. Add JavaScript to each page to store the variation ID in the field. For VWO, the variation the user was assigned to is stored in the _vis_opt_exp_*_combi cookie.

If you're running an A/B test, insert this value directly for each variation:

						
						$(function() {

						  // Replace 111 with your test ID
	
						  $("#variation_id").val("Test001-B");
								
						});

If you're running a split test, then you'll need to add it to the page directly:

						
						<input type="hidden" id="variation_id" value="Test001-B">

If the page you're testing is not the page that has the form, add code to the page that does contain the form to read the variation ID from the cookie:

						
						$(function() { // Remember to always wait for DOM ready

						  // Replace 111 with your test ID
	
						  $("#variation_id").val(yourGetCookieFunction("_vis_opt_exp_111_combi"));
								
						});

Remember to update the field values and the cookie number when you start a different test.

Show same variation when visitors return

Each visitor must ever see only one branch of your test. Tools like VWO take care of showing a random variation to each visitor and showing them the same variation each time they return. If you implement an A/B test manually using JavaScript or some other tool, make sure you show a random page on first visit, store that variation ID in a cookie, and show the same variation when the visitor returns.

Run split tests even on URLs you can't change

Sometimes you need to run a split test, but you you can't create a new page URL for your variations. For example, you might have a dynamically generated or a CMS-based site, so a page might be dynamically generated and must resolve to a specific URL like www.mysite.com/checkout.

To run a split test, use the same URL for all variations but add a URL parameter to each variation. For example, your variation B will point to www.mysite.com/checkout?v=b. On your back end, check for the URL parameter and then display the right version of the page. If the changes are minor, consider setting up a dynamic A/B test instead.

[Update: With VWO's recent update, asterisks are no longer needed]In VWO, you'll need to add asterisks. you'll want to make your Control URL *www.mysite.com/checkout and your variation URL www.mysite.com/checkout?v=b* (see this VWO article for details)

Do 1 Test At A Time to keep it valid

A visitor must never be subject to more than one test at a time. That's like testing the effectiveness of a drug, while randomly giving patients other drugs. Either test one page on your entire site at a time or take steps to make sure tests do no overlap.

A valid A/B comparison assumes that visitors have an equal chance to see either A or B. In order for A and B to serve as controls for each other, there can be no other changes, no matter how minor. With two A/B tests, suddenly you have 4 possible variations of the page, while each A/B test thinks there's just 2. If you're testing multiple pages in your funnel, test one page at a time to avoid contaminating each test. If you're A/B testing Step 2 of your funnel, this assumes all visitors to A and B come from exactly the same Step 1. You can't have a test running on Step 1.

If you have enough traffic to run multiple tests, set up mutual exclusions so visitors to one test are excluded from any other tests (and then tripple-check to be sure). Alternatively, test pages that are not directly linked, such as internal pages and public pages.

You can test multiple pages at the same time by testing them as a whole in one test (see Tip 20). So, a visitor would see either all changes on the site or none.

A multi-variate test is a valid way to test multiple factors simultaneously but only if you suspect there is a relationship between them. It's the same thing as testing and tracking each permutation as a separate variation of a SINGLE test. That said, we discourage multi-variate tests, because they inflate the number of variations, typically lead to small effects, and often are not grounded in a good hypothesis (a reason for doing it). All of that increases your risk of finding false positives dramatically and requires much longer duration.

Change Form Fields without back end dev

You can test changes to your form without making back end changes. A typical change we make is to merge First and Last Name fields into one Full Name field. We add the new field and simply hide First and Last Name fields. We then split the full name in the background.

Here's a code snippet that does this:


						// When the user enters a full name

						$("#fullname").change(function() {

						    // Split up and populate the First and Last Name fields

						    var name = $(this).split(" ");

						    $("#firstname").val(name[0]);

						    $("#lastname").val(name[1]);

						});

If you do this, don't forget to consider cases where the Full Name, First, and Last name are pre-populated by the browser:


						// On load

						$(function() {

						  var firstname = $("#firstname");

						  var lastname = $("#firstname");

						  var fullname = $("#firstname");

						  // In case full name is already populated

						  if(fullname.val()) {

						      // Split up and populate the First and Last Name fields

						      var name = $(this).split(" ");

						      $("#firstname").val(name[0]);

						      $("#lastname").val(name[1]);

						  // If First and Last Name fields are not blank, populate Full Name
						
						  } else if(firstname.val() && lastname.val()) {

						      fullname.val(firstname.val() + " " + lastname.val());

						  }

						});

Using the same technique, you can change fields to other types of controls, you can hide fields, or display information in different ways. We also typically remove the "Confirm Password" field using the same method, adding a small link that lets visitors mask or show the password characters.

Develop in Greesemonkey to speed up dev

If necessary, you can build complex, dynamic A/B tests in VWO (insert code dynamically into the page instead of doing back-end coding). When doing so, use Greesemonkey or another user script tool, then move code into the in-browser editor provided by VWO. VWO's editor has advantages but can be finicky and lacks version control.

Develop the Javascript and CSS in separate files using your current favourite coding tool. Then install Greesemonkey and set it to load your JavaScript and CSS dynamically when you visit the test page:


						$("head").append("<script src='http://yoursite.com/dev/seating.js'></script>");

						$("head").append($("<link>").attr({"rel" : "stylesheet" , "href" : "http://yoursite.com/dev/seating.css"}));

If you're inserting a lot of HTML using JavaScript, use a tool like this HTML to JS converter to create a JavaScript-friendly string you can insert into the DOM.

Once your script and CSS are ready, you can usually copy it straight into VWO's editor.

If you find flickering on page load, then use VWO's "Edit HTML" feature instead of JavaScript for the affected elements, because this tells VWO to hold off showing those elements until they are ready. The disadvantage is the HTML is then locked into VWO's environment instead of residing in one .js file.

Use Async JavaScript to build complex tests

Your main goal is not to build perfect code but launch the test and see if your ideas were worthwhile. Sometimes it's faster to overlay new functionality on top of an existing UI, rather than build it at the back end properly. If it pans out, you rebuild it properly. If it fails, you throw it away and test something else.

The pattern is to add asychronous event listeners and modify the UI when needed. For example: we recently needed to add "available inventory" to a site's dynamic event listings. There was no such functionality. Rather than wait for the dev team to build new backend queries, we set up front-end event listeners to wait until the list is populated. As soon as the list was ready, we opened up an iframe in the background, loaded a different page where the inventory was available, scraped the inventory from the HTML, and then showed a status on the test page like "Sold Out" or "Confirmed". It looked seamless to users and involved no backend development.

The basic pattern we use is:


						// PART 1: Listen for first page load or user action that modifies the list

						waitForList();
	
						$("#buttonRefreshList").on("click", waitForList);
						
						

						// PART 2: Function that waits and tells me when list has loaded

						function waitForList() {

						  var checkPage = setInterval(function() {

						    if($("#some-list").children().length && $(".list-loaded").length == 0) {

						      $("body").trigger("listLoaded");

						      // Stop waiting

						      clearInterval(checkPage);

						    }

						  }, 150);

						  // Add tag that is wiped out when list is refreshed, so I know when it's refreshed

						  $("#some-list").append("<div class='list-loaded hidden'></div>");

						}

						

						// PART 3: Call UI-modifying function whenever the list is modified

						$("body").on("listLoaded", modifyList);

						

						// PART 4: Define the UI-modifying function

						function modifyList() {

						  $("#some-list").find("button").text("Get it now");

						}

Use Heat Maps to corroborate results

There is other evidence out there besides your primary metric for positive changes in user behavior. Sometimes a change in behavior can show up on a heat map, which tells you where users click or don't click. For one thing, a heat map can tell you what visitors were interacting with that had an effect. It can also tell you how focused or distracted visitors were.

In one test, featured in Data Story #9, we used the heat map to corroborate our statistically strong finding. We tested a new home page with a simple gradual engagement element: a clear question with several buttons to choose from. We also tested a very minimal and a more complicated variation.

Our hypothesis was that gradual engagement would guide visitors better toward signing up. The heatmaps for the other variations showed clicks on top menu link and all over the page. In contrast, our winner showed clicks on just the component we were testing with fewer menu clicks and virtually no distracted clicking elsewhere. This was reassuring. We also saw a pattern in the choices visitors were clicking.

GoodUI BETTERDATA