The art of A/B testing

Walk through the beautiful math of statistical significance

Published in

Towards Data Science

10 min readOct 3, 2018

A/B testing, aka. split testing, refers to an experiment technique to determine whether a new design brings improvement, according to a chosen metric.

In web analytics, the idea is to challenge an existing version of a website (A) with a new one (B), by randomly splitting traffic and comparing metrics on each of the splits.

For the sake of example, let us look at beauty company Sephora South-East Asia (Sephora SEA, where I work), and in particular, its e-commerce platform.

The use case

Let us assume Sephora SEA is considering a landing page rearrangement for its Gold members.

Landing pages pane arrangement: version A (current) and version B (alternative to be tested)

The metrics that matter to the company are:

Average time spent on the landing page per session
Conversion rate, defined as proportion of sessions ending up with a transaction

An A/B test can be used to challenge to current arrangement.

Note that you can choose the split of traffic not to be 50â€“50 and allocate more traffic to version A, in case you are concerned about losses due to version B.

However, keep in mind that a very skewed split often leads to longer times before the A/B testing becomes (statistically) significant.

A/B testing as the only ultimate test

The thing with the development of new features on a website is that you hardly know beforehand how these are going to perform: they may actually hurt your business or make your profits skyrocket.

The knowledge of a User Experience (UX) designer is crucial in singling out feature suggestions that are likely to work. It would often follow best practices in UX or design examples that proved to be successful in other similar contexts.

However, no prior assumption can beat the real live test that is the A/B test.

The A/B test measures performance live, with real clients. Provided it is well executed, with no bias when sampling populations A and B, it gives you the best estimate of what would happen if you were to deploy version B.

Need for statistical formalism

After doing some prior market study, Sephora SEA decides it would be interesting to live test version B, with the following traffic split:

Let us assume that after 7 days of A/B testing, the tracking metrics of the experiment are

Just from looking at these outcomes, some questions arise:

Because version B exhibits higher CR, does it mean version B brings improvement? Similarly, can we conclude on the influence on the average time spent?
If so, with what level of confidence?
Did a higher CR/higher average time spent of version B happen by chance?

Before jumping into conclusions, what you need to keep in mind is that

The raw results we have are only samples of bigger populations. Their statistical properties vary around the ones of the populations they come from.

Therefore, statistically modelling these outputs is necessary. Introducing the concept of statistical significance in hypothesis testing also is.

A primer on Hypothesis Testing

For in-depth explanation of hypothesis testing, I would recommend this great post from William Koehrsen.

Statistical Significance Explained

What does it mean to prove something with data?

towardsdatascience.com

As for people looking for a quick statistics refresher, you may want to look at this article from Cassie Kozyrkov.

Statistics for people in a hurry

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Letâ€¦

towardsdatascience.com

Here, I will go through the main lines of hypothesis testing: a tool to compare the distributions of 2 populations, based on samples from them.

What can be compared are either parameters of their distributions (eg. mean of time spent) or the distributions themselves (eg. the binary distribution of conversion rate).

The process starts in stating a null hypothesis Hâ‚€ about the populations. In general, it is the equality hypothesis: eg. â€œthe two populations have the same meanâ€.

The alternative hypothesis Hâ‚ negates the null hypothesis: eg. â€œthe mean in the second population is higher than in the firstâ€.

The test can be summarised in two steps:

1. Model Hâ‚€ as a distribution on a single real-valued random variable (called the test statistic)
2. Assess how likely the samples, or more extreme ones, could have been generated under Hâ‚€. This probability is the famous p-value. The lower it is, the more confident we can be in rejecting Hâ‚€.

In the aforementioned post, @williamkoershen tests if the mean of hours of sleep in a university is lower than the national average, thus comparing an estimated value to a theoretical value. He uses a Z-test to do so.

Here, I suggest to extend that framework to A/B testings in the case of Sephora SEA.

In particular, I will show:

how the Z-test can be applied to testing whether the clients experiencing B spend more time on average
how the Ï‡Â² test can be used to decide whether or not version B leads to a higher conversion rate
how the Z-test can be adapted to test conversion rate of version B and if it yields the same conclusion as the Ï‡Â² test

1 | Z-test for average time spent

The hypothesis to test are:

Hâ‚€: â€œthe average time spent is the same for the two versionsâ€
Hâ‚: â€œthe average time spent is higher for version Bâ€

The first step is to model Hâ‚€

The Z-test uses the Central Limit Theorem (CLT) to do so.

Illustration of the CLT (from Wikipedia)

The CLT establishes that:
Given a random variable (rv) X of expectation Î¼ and finite variance ÏƒÂ²,
{Xâ‚,â€¦,Xn}âˆ¼ X, n independent identically distributed (iid) rv, the following approximation on their average (also a rv) can be made

In our context, we model the time spent for each client session i as a realisation:

aáµ¢ of rv Aáµ¢ âˆ¼ A, if the client session belongs to the version A split
báµ¢ of rv Báµ¢ âˆ¼ B, else

We use the approximation provided by the CLT to derive that

Hence the approximation for rv of the difference (wikipedia: sum of normally distributed rv),

Under Hâ‚€, we have equality of the true means and therefore the model

Curves about N(0,1): centrered and reduced Gaussian distribution, probability density function (pdf) and associated p-values

The second step is to see how likely our samples are under Hâ‚€

Note that true expectation and variance for A and B are unknown. We introduce their respective empirical estimators:

Our samples generated the following test statistic Z, which needs to be tested against the reduced centered normal distribution:

Conceptually, Z represents the number of standard deviations the observed difference of means is away from 0. The higher this number, the lesser the likelihood of Hâ‚€.

Also notice that in the case the estimated expectations are actually different, (number of samples)â†—, Zâ†—.

From the formula of Z, you can also get the intuition that the smaller the difference to prove is, the more samples you need.

In Python, the calculation looks like

p-value calculation and graphical representation

There is a pvalue chance that a result as extreme as the one we observed could have happened under Hâ‚€. With a common go-to Î± criterion of 5%, we have pvalue<Î± and Hâ‚€ can be rejected with confidence.

In cases where the sample size is not as big (< 30 per version), and the CLT approximation does not hold, one may take a look at Studentâ€™s t-test.

2 | Ï‡Â² test for conversion rate

The hypothesis to test are:

Hâ‚€: â€œthe conversion rate is the same for the two versionsâ€
Hâ‚: â€œthe conversion rate is higher for version Bâ€

Unlike the previous case, the outcome for each client session is not continuous but binary: either â€œnot convertedâ€ or â€œconvertedâ€.

The summary of the observed outcomes is the following

The Ï‡Â² test compares distributions of multinomial outcomes but we will keep to the binary case in this example.

As before, we will tackle the problem in two steps:

The first step is to model Hâ‚€

In Hâ‚€, conversions in version A and version B follow the same binomial distribution B(1,p). We pool the observations in both version A and B and derive the estimator for CR

and get \hat{p} = 0.0170

Thus, under Hâ‚€, the theoretical outcome table is

Let us look at the rv D, defined by

D represents a squared relative distance between the theoretical and the observed distributions.

According to Pearsonâ€™s theorem, under Hâ‚€, D follows a Ï‡Â² probability law with 1 degree of freedom (df).

The second step is to see how likely our samples are under Hâ‚€

It consists in computing the observed D and deriving its corresponding p-value according to the Ï‡Â² law.

This is how it can be done in Python:

p-value calculation and graphical representation

There is a pvalue chance that a result at least as distant from the theoretical distribution as our observation would have happened under Hâ‚€. With a common go-to Î± criterion of 5%, we have pvalue>Î± and Hâ‚€ cannot be rejected.

3 | Z-test for conversion rate

The Z-test could be adapted to conversion rate by modelling conversion as an rv which realisations are in {0,1}:

1 for a conversion
0 else

We keep the same notations as before and model conversion for each client session i as a realisation:

aáµ¢ âˆˆ {0,1} of rv Aáµ¢ âˆ¼ A, if the client session belongs to the version A split
báµ¢ âˆˆ {0,1} of rv Báµ¢ âˆ¼ B, else

The first step is to model Hâ‚€

Under Hâ‚€, Î¼(A) = Î¼(B) and we have

The corresponding test statistic

This time, with binary rvs, it can be shown that the estimators for the standard deviations are functions of the expectations:

The second step is to see how likely our samples are under Hâ‚€

To this end, we compute the Z-score and the corresponding right-tailed p-value:

p-value calculation and graphical representation

With this modelling, the p-value output is slightly lower than with the Ï‡Â² test. With the same Î±=0.05 criterion, we would have rejected the null hypothesis (!!!).

This difference may be explained by a slight weakness of the Z-test, which does not acknowledge here the binary nature of the rv: Î¼(B)-Î¼(A) is actually bounded in [-1,1] and the observation is therefore attributed a lower p-value.

Always question your tests

and never make assumptions. A/B testing is indeed a great way to alleviate human bias when deciding on relevance of new features. However, do not forget that A/B testing still relies on a model of truth: as we have seen, there are different possible models.

In the case of large samples, they tend to converge to similar conclusions. In particular, the CLT approximation holds better than with small sample sizes.

In the latter cases, one may explore Studentâ€™s t-test, Welchâ€™s t-test and Fisherâ€™s exact test. You may also explore the realm of Reinforcement Learning in order to maximise gains while testing (Multi-armed bandits and the Exploitation vs Exploration dilemma).

Not only should you be strict in your interpretations of results but also be aware of contextual effects of your A/B test:

time of the year/month/week, the weather, the economic context can affect the nature of your audience
even if after two days of A/B testing your results are significant, they may not be over the course of a week

Main take-home messages

Hypothesis testing is about modelling a null hypothesis Hâ‚€ and assessing how likely it is, given the samples you got from the A/B test
The key is in the Hâ‚€ model and we have seen, it can be derived from the CLT (Z-test) or Pearsonâ€™s theorem (Ï‡Â² test)

This concludes our statistics tour: I hope you enjoyed the ride as much as I enjoyed writing it! Comments, suggestions, corrections are much appreciated.

Credits:
Photos by rawpixel and Louis Reed on Unsplash

The art of A/B testing

Walk through the beautiful math of statistical significance

The use case

A/B testing as the only ultimate test

Need for statistical formalism

A primer on Hypothesis Testing

Statistical Significance Explained

What does it mean to prove something with data?

Statistics for people in a hurry

Ever wished someone would just tell you what the point of statistics is and what the jargon means in plain English? Letâ€¦

1 | Z-test for average time spent

The first step is to model Hâ‚€

The second step is to see how likely our samples are under Hâ‚€

2 | Ï‡Â² test for conversion rate

The first step is to model Hâ‚€

The second step is to see how likely our samples are under Hâ‚€

3 | Z-test for conversion rate

The first step is to model Hâ‚€

The second step is to see how likely our samples are under Hâ‚€

Always question your tests

Main take-home messages

Published in Towards Data Science

Written by Sylvain Truong

Responses (12)