Udacity A/B Testing-Lesson 3: Choosing and Characterizing Metrics

Variability: Analytical vs. Empirical

regularizer
3 min readAug 21, 2019

Use A/A tests to

  • Compute variance and confidence interval based on the assumption of the distribution (usually normal distribution)
  • Directly compute the confidence interval without any assumption of the distribution
  • Compare empirical results to analytical results (sanity check)

For example, 20 A/A experiments, 50 users per group in each experiment and one click-through-probability computed based on one experiment from 50 + 50 users. The following table shows 20 experiments (20 rows). Take the first row for example. Based on the clicks and pageviews of 50 users in Group 1 and 2, the CTP is 0.1 and 0.04. The difference is diff=0.1–0.04 = 0.06.

20 experiments (20 rows)

The pooled standard deviation of each experiment is computed based on the analytical formula. This assumes the distribution of the clicks is a binomial distribution. ???

SE_pooled = 

The empirical standard deviation, 0.059, is computed from 20 diff values. Note that from these 20 experiments, there is only one empirical variance of p_diff but 20 analytical pooled SE. It is the same for confidence interval. If the diff is assumed to be normal distribution, the confidence interval can be computed with SE and z score. Based on the histograms, although it doesn’t look like a normal distribution, it may be due to the small sample size (only 20 or 10 data points to create histograms). Note that when computing the pooled SE for each experiment, Ncont and Nexp (not the number of experiments) affect the pooled SE. As N increases, the SE decreases.

margin of error = empirical_SD * z(95%) 
??? empirical SD not SE
Need to do SD/# of experiments ???
The distribution of 20, 20 and 10 experiments with 0.5%, 1.0% and 5% traffic per group.
Compare the empirical SD and analytical SE. If assuming a normal distribution, still use the same formula to compute the confidence interval

If the diff is not normal, directly estimate empirical confidence interval, [-0.1, 0.06]. Using the assumption of normal distribution and the empirical variance, the confidence interval is [-0.097, 0.097]. They are somewhat different because there are only 20 experiments (data points). And because of that, we can’t distinguish 95% or 90% confidence interval in this case.

Directly compute empirical confidence interval.

If there is no enough traffic to perform many A/A experiments, bootstrapping (sample with replacement) can be used to simulate the experiments with just two group of users.

Bootstrapping

Variability Summary

Choosing metrics is a tricky thing.

  • Although some metrics make great sense to the business, they may have large variability so that it won’t show statistical significance for experiments.
  • Some metrics have analytical form for variance and confidence interval based on the assumption of the distribution of the metrics.
  • If the above metrics are not valid in practice, empirical variance and confidence interval would be computed.

--

--