The Fatal Assumption I Made Assessing Multiple A/B Tests

Published in

Permutable Analytics

4 min readNov 1, 2021

A/B tests are awesome — they are one of the most effective methods for supported decision-making whilst avoiding confusing correlation with causation. Even better, its totally reasonable to run multiple A/B Tests at once. In the best case, running multiple A/B tests simultaneously will require only the same amount of data (and time) as a single test would require. In the worst case (and the case I want to put on your radar) they will take the same amount of data as running them each in isolation would.

For simplicity, let’s consider running exactly two A/B Tests at once. Say, for instance, that you want to A/B test which of two possible color palates will encourage the reading of more articles (conversions) on your app. You simultaneously want to compare adding a section for articles about food against one for articles about shopping. You thusly create four unique versions of the app and randomly assign new users to one of each: ‘yellow and food’, ‘blue and food’, ‘yellow and shopping’, and ‘blue and shopping’.

You then want to show with 95% certainty which color palate and new article category causes the greatest number of articles to be read. You’ve waited for 2000 new app downloads and you’re finally ready to assess. First, looking at color palate, you compare all 1000 users given one of the two ‘yellow’ versions to the 1000 given the ‘blue’, regardless of which new article category they were offered. After plugging in the results into the appropriate statistical significance calculator (maybe a T-Test for parametric Boolean data or a Mann Whitney U-Test for non parametric), you find that the blue versions of the app encouraged more conversions with more than 95% certainty. Similarly, you find with great certainty that the ‘food’ versions performed better. As you confidently claim that ‘blue and food’ must be the greatest version of the app, you notice something (extremely) worrisome — individual app version ‘blue and food’ doesn’t even have the highest mean article reads of the 4 versions. In fact, both ‘yellow and food’ and ‘blue and shopping’ produce, on average, more article reads.

But how could this have happened? The short answer: some features work better or worse together. You may have been unaware that studies have shown yellow to make stomachs grumble and blue to make people shop till they drop. In this case, the ‘yellow and shopping’ version may have performed so poorly when compared with the other versions that the ‘yellow’ versions were statistically worse than the ‘blue’ versions and the ‘shopping’ worse than ‘food’, an example of interfering tests. Because these tests are interfering, each app version must be compared to the other 3 using an Anova or Kruskal-Wallis Test, which will call for more data or extreme results to prove significance.

You may on the other hand be testing two features that do not effect one another (non-interfering). That is, individual features do not work better or worse depending on other features they are paired with. A possible example of this might be simultaneously testing whether or not to allow comments on articles and whether to send push notifications. These two decisions should theoretically not effect each other. In that case, the best way to test for significant results would be to do exactly what failed in the above example. With only 2000 users to sample on, 4 unique (overlapping) groups of 1000 users can be extracted to draw conclusions from, and both design questions will be answered with less data.

How can you know for certain that two tests are interfering or not? It’s a very fuzzy line which should be walked with caution. Once you have theorized or supported that your tests are non-interfering, a decent way to double check is by making sure that the statistically significant combination of features also has the highest mean, although this is not a statistically significant check.

If you have strong reason to suspect that certain features pair well together, then consider testing them as a single A/B test. In the above case, for instance, this would look like A/B testing a new yellow food section against a blue shopping section.

As long as you are well aware of this nuance, it is never detrimental to run multiple A/B tests simultaneously, unless you need results for one test faster than another. That is, running two interfering tests simultaneously will produce statistically significant results in twice the time required for running one on its own.

You are now ready to start running multiple A/B Tests at once, whether they are interfering or not! If you have unanswered analytics questions or could use some help with your project please don’t hesitate to reach out! My agency Permutable specializes in helping you get the most out of your analytics setup, interpret data, and make decisions that will lead to growth.

The Fatal Assumption I Made Assessing Multiple A/B Tests

Written by Freddie Lancia