Statistical significance is probably the most confusing and most misunderstand thing in all of A/B testing. Both what it is and why you need it. When you're done reading this, you'll understand both.
Let's assume that you want to know whether the average height of people in the United States is higher than the average height of people in France. How would you determine that? Easy. You just measure the height of everyone in the US and everyone in France to compare the average height.
Or do you?
In an utopian world, you do. But in the real world, it's impossible to get the height of each and everyone in the US and France. Nobody has the resources, or the time, to measure everyone. So instead, you rely on smaller samples where you measure, e.g., 5000 people in the US and 5000 people in France.
To put this into the perspective of eCommerce and A/B testing, this is the exact same problem as running A/B tests. Imagine that you have two variants, A and B. Like in the height example, you can't show each variant to the whole world to see which one has the highest conversion rate or sells the most. So you have to rely sampling, and you collect samples by showing 50% of your traffic variant A and the other 50% variant B to get two samples.
The challenge is that samples comes with randomness. To understand what I mean by randomness, imagine that you ask only 20 people on the street. Let's say that the average height of those 20 people is 177 cm (5'8").
Now ask another 20 people. Do you think the average height will end up being 177 cm. (5'8") again? Maybe. But it's not difficult to imagine that if you ask another 20 random people on the street, their average height could easily be very different from the 177 cm. (5'8") of the first 20. Maybe you, by chance, meet 20 tall people, so their average height is 192 cm. (6'3"). Or you meet 20 smaller people, so their average height is 169 cm. (5'54").
If you continue this experiment of asking 20 people 10 times, you may end up with widely different average heights, e.g., 177 cm. (5'8"), 192 cm. (6'3"), 169 cm. (5'5"), 170 cm. (5'57"), etc.
This is what I mean by randomness. When you rely on samples, not the entire population, you don't get a single, average height that you can trust is representative of the entire population. Instead, you get different results across samples - whether that's a sample of 20 people of 5000 people. You get data with some random factor in.
This randomness is why you need the concept of statistical significance. Here's why.
Let's go back to our example and ask a sample of 5000 people in the US and 5000 in France about their height. When we're done, we get an average height of 174 cm. (5'7") in the US sample and 172 cm. (5'64") in the French sample.
Does this mean that you now know that the US population is taller than the French population?
No, unfortunately not. As you just learned, there's a random aspect to samples. So the US sample might just be taller due to randomness and not because the US population really is taller than the French.
So you need a measure to tell you when one sample is different enough from another sample to conclude that it's bigger, better, higher or whatever you're comparing. And that's exactly what statistical significance is.
Let's recap. We want to know whether the average height of the US is higher than the average height of France. Remember that this is the exact same problem as comparing two variants in an A/B test.
We've sampled 5000 people from the US and 5000 people from France. The average height of the US sample was 174 cm. (5'7") and the average height of the French sample was 172 cm. (5'64").