How to conduct robust Product experiments
Understanding the effect of chance in test results
👋 This post appeared in my weekly newsletter on Product Management.
Sign up here to get the next post straight to your inbox.
This is a follow-up to my previous post on AB tests. Since that post, I have received quite a few questions about AB Tests — but there was one question that I received from many readers.
How many samples do we need to reach Statistical Significance? What’s the intuition?
For Product Managers, this question in many ways is the most important one when testing. Inadequate sample sizes can really throw off the results of a test and lead us to draw the wrong conclusions. So getting sample sizes right is important.
But there is more to statistical significance than just sample sizes — significance depends upon several other factors in addition to sample sizes. But we rarely get to hear about the other factors because of modern AB testing software largely abstracts these away.
This is both good and bad.
Good because as product builders — our time is best-utilized building products that customers love. why not let the AB testing software do the Stats heavy lifting for us?
The downside is that if we don’t understand the high-level concepts that the AB test software uses — we think that the only thing that matters is the sample size. Somewhere in between is the balance we need.
We need to make sure that we are using the software — and not the other way round.
This is going to be a long article with some statistical concepts that are good to know for PMs and Marketers — but if you are looking for quick takeaways here they are.
- True Statistical significance depends not just upon Sample Size — but also Minimum Effect Size you want to detect, variability in your samples, and the power of your tests. All these are interrelated.
- Statistical Significance is relative, it's not an absolute value. You need to define what is the p-value that makes sense for your tests, your product, and your industry.
- Just because you get statistically significant results doesn’t mean it's right. It is quite possible that your test results reject the null hypothesis when they should not. This is called a False Positive. This might happen when there is variability in your samples. Remember that even for a 95% significant result — there is still a 5% chance that it may be a False positive.
- Just because you DID NOT get statistically significant results doesn’t mean there isn’t any. This is called a False Negative — when your test doesn’t reject the null hypothesis but it really should. This happens when you want to detect very small changes in the treatment group and your test is underpowered.
- There are several sample size calculators available that — given the magnitude of effect you want to detect — will tell you the sample size you will need to achieve a certain significance. For instance, take a look at optimizely’s sample size calculator.
- Use these calculators to determine sample size. The caveat is that these calculators have pre-set defaults which might differ — which means the sample size might differ as well.
Sounds like something you want to learn more about? Feel free to continue reading.
Significance is Relative
Let us first acknowledge that Significance is not a destination. It's not some absolute value or even a milestone. What is statistically significant for a marketing campaign — may not be significant enough for drug trials.
If you are a VP of Marketing — a 90% significance on A/B test results may be enough to attribute the lift in sales to that website rebrand. Of course, there is still a 10% chance that the lift in sales is simply due to the randomness of the sample tested and has nothing to do with the rebrand. But for a marketer that may be palatable.
But, if you are a researcher trying to find out if a certain drug cures a certain disease — that 10% chance might be too high. You might want a significance level of 99.95%, or else you cannot confidently say that the drug really works!
So you see, statistical significance is relative — it depends upon what is it that you are testing — and more importantly what are the implications of that test? Ultimately the only important question is this — How accurate do you want to be when you are establishing causality? How much chance are you comfortable with?
Statistical Significance is not just about Sample Size
Sample size does play a big part for a test to reach significance but its not all that there is to it. Hypothesis testing takes into account additional factors — all of which come together to help a test reach significance. Sample size takes most of the limelight but each one of them is almost as important.
1. Sample Size
Let's start with the obvious. The role of sample size and statistical significance is quite widely known. We know that the larger the sample size, the higher the possibility of the test to return statistically significant results. This is because with larger sample sizes it's easier to differentiate small causal effects from pure chance.
But you cannot simply select a sample size in isolation without defining the size of the effect you want to detect.
2. Size of the Effect you want to detect
This one however is less widely known, but it's so important.
Say the baseline conversion rate of your landing page is on an average 20% for every 100 visitors per day. You know this because you’ve measured it across several months or even years.
You want to test whether updated marketing copy decidedly increases your conversion rate at least by 5% — which is 21 conversions instead of 20. In other words, you want to test if 1 more customer converted because of the marketing copy.
And as a rule of thumb the smaller the change you are trying to detect, the harder it is for the hypothesis test to reach significance. This is because it's harder to tell whether one extra customer converted due to the marketing copy or due to random chance.
To detect such a small effect — You’d have to run the test for several days (which basically means you are increasing your sample size).
On the other hand, if you wanted to detect a 50% increase in conversion -i.e. 10 extra customers a day — it would be much easier to detect an effect of this size as it's quite unlikely to occur due to pure chance and smaller sample size would be just fine.
3. Variability in Samples
It's important to remember that when we conduct an AB Hypothesis test we are working with samples. And no two samples are exactly the same — they’ll have some variability. As much as possible we must ensure that the samples we are testing have minimum variability. Too much variability in samples can throw False Positives-which means that your tests show statistical significance, where there isn’t any!
A/A testing is an excellent way to test any natural variability in samples. For instance, here’s what a company noticed when they A/A tested an exact copy of the landing page with their testing software. The software returned a 15% lift in conversion and the results were statistically significant! How can that be? It's the same page!
Such results are false positives — but if you are not careful — they might lead you to make mistakes and arrive at the wrong conclusions. False positives are often a result of variability in traffic/samples. The software splitting traffic to A/B versions of the page perhaps created a sampling error. This can happen if your target test segment is too broad with high variability.
The message is this — Just because something is statistically significant doesn’t mean it’s right. An A/A test should always be inconclusive — but it is a good way to sanity check that your testing infrastructure is behaving as it should.
4. Power of a Statistical Test
Suppose you wanted to know if the planet Saturn has any moons orbiting around it or not. If you use your bird-watching binoculars and looked at the night sky — you’d not see any moons.
But if you take a research-grade telescope and looked at it, you’ll see that indeed Saturn has moons ( ~82 moons). The moons were always there — but only telescopes with the right power can detect them.
Statistical tests are the same way. They must be suitably powered to detect the changes you want to detect. The concept of Power of Statistical Test is so important, and yet so easily misunderstood and misused.
The power of a statistical test is the probability of the test detecting an effect IF there is truly an effect. In other words, its the probability that the test correctly rejects the null hypothesis.
If you concluded that Saturn had no moons simply based on what you observed (or didn’t observe) using your bird-watching binoculars — you made an error. More specifically, a Type II error in Statistics. You got a False-negative — You are not rejecting the null hypothesis when it in fact must be rejected. Your test is underpowered.
Typically one should ensure that hypothesis tests are at least 80% powerful — which means that at least 80% of the time, the test must be able to reject the null hypothesis correctly.
The good news is that with most testing software/sample size calculators you do not need to bother with these details. The software will automatically tell you what sample size you will need — to detect tiny differences between effects. Most of these software packages and calculators have a default power pre-defined.
For instance, Optimizely’s Stats Engine defaults to 1.
The takeaway is — you need the right powered test to confidently detect the presence or absence of an effect.
The above Concepts in Action
Simple question — What’s the probability of getting heads in a coin toss? most of us would say it's 50%. Sure, it's 50% — but only if we flip the coin many many times.
The graph above shows that if you keep flipping a coin many many times, the probability of heads (or tails) ultimately approaches 50%.
Now if we have two coins and one of them is a biased coin (meaning the probability of Heads / Tails is not equal). How’d we distinguish which one is biased?
The graph below shows what a fair coin toss looks like after 100 trials.
You can see that even a perfectly fair coin doesn’t return exactly 50 Heads. In fact, it's also likely that it returns 51, 52, 57 Heads, etc.
This means if you flip a fair coin and a biased coin (say biased to give Heads 60% of the time) — it’ll be hard for you to detect which is which in 100 trials. One of the coins is clearly biased, just a tiny bit but enough for our test to miss detecting it if you flip both coins only 100 times. Trying to detect the bias in 100 trials is akin to trying to detect Saturn’s moons with bird-watching binoculars. 100 trials is underpowered to detect a 60% biased coin.
Now Say we increased the sample size to 1000 trials.
It's highly unlikely that a fair coin will return 600 Heads in 1000 tosses. If you flip a coin 1000 times and observe 600 Heads it's very likely a biased coin. Your test won’t miss it — the power of the test increased because you increased the sample size.
But what if you cannot increase the sample size? After all, there are direct/indirect costs to increasing sample sizes. What if all we had was 100 trials.
The only other way would be by increasing the Minimum Detectable Effect. What if we could bias the coin some more to give Heads 90% of the time?
Now our 100 trial test is comparing a fair coin against a VERY biased coin. As you can see from the graph — it's highly unlikely that a fair coin will return heads 90 out of 100 flips. If you observe 90 Heads in 100 coin tosses, something is surely wrong with that coin.
Of course, all of this can be proven through complicated math — but its unnecessary.
As long as Product managers and builders understand these concepts and ensure that our hypothesis tests are adequately powered, our statistical significance expectations are reasonable, and our sample sizes are right — we will conduct solid experiments and can expect definitive results.