Udacity A/B Testing Lesson 4: Designing an Experiment

Overview

8 min readAug 22, 2019

Choose “subject” — units of diversion
Choose “population” — equivalent population
Size
Duration and Exposure

It is an iterative process to try out some decisions for unit of diversion and population, see what the implication is on both the size and the duration of the experiment. Depending on the results, we will need to revisit the decisions and iterate.

Unit of Diversion

Unit of diversion basically answers the question that “how to assign events to either the control or to the experiment”. Even though the metric is computed based on the events (e.g. page view), the unit of diversion decides how these page views appear in the two groups, randomly or from a certain subset of users. For example, if the unit of diversion is a page view, which is an event-based diversion, each page view will be assigned to one group randomly and the user who viewed the page can be mixed into two groups. If the unit of diversion is user_id based, each user is assigned to two groups. The control group only has page views from user A and the experiment group only has page views from user B.

user_id based: user consistency if they log in
cookie-based: user consistency if they use the same browsers and devices
event-based: least user consistency because the users may be mixed into the two groups

Both user and cookie diversion are proxies for users. For the user and cookie based diversions, one group of users are on the A side and the other group of users are on the B side. More precisely, for the user_id based diversion, the user accounts are assigned to the control and experiment groups, although multiple users may use the same account. For the cookie_id, the cookie ids are assigned to the two groups although the same user may have multiple cookies (devices/browsers). Then from there, the page views are generated from the two groups of accounts or cookie ids. For the event-based diversion, the page views are randomly assigned to the two groups, which may include the same users on both sides. If the metric is CTR, then apparently the page views for the user_id based diversion may have correlations because multiple views may come from the same user. (???)

Unit of Analysis vs. Unit of Diversion

When the unit of analysis is different from the unit of diversion, the analytical variability may underestimate the empirical variability (empirical > analytical). Because the samples (page views) are assumed to be independent for the analytical form but the page views may have correlations in practice. Look at the example in the following picture.

The variability for different unit of diversion based on this paper

The above picture shows (1) the SE linearly increases with 1/sqrt(N), which is not surprising; (2) the SE of cookie-based diversion is higher than event-based diversion. Because the queries that come from the same cookies, these queries can be “correlated” (e.g. query something first, then query further from there).

Questions about the curve in the paper: the SE’s are computed from empirical variability assuming the coverage follows binomial distribution? If so, the variance is computed from the coverage (p). For (1), due to the linearity, the variance is almost the same for different samples, the coverage is the same too. For (2), the variance of cookie is higher than that of query, the coverage is higher.

Does the unit of analysis and the unit of diversion match?

For the second question In the above quiz,, the unit of analysis (cookie)is “larger” than the unit of diversion (page view). The metric is not well-defined because the same cookies can be assigned to both groups. To think about a concrete example, make some changes on the homepage : (1) randomly assign cookie ids to Group A (no change) and B (change homepage), cookie m and n are in Group A and B, measure how many cookies (percentage of cookie ids within each group) look at the homepage during a time period; (2) randomly assign page views to the two groups, how to make sure the page w/o go to different groups.

Choose “population”

Intra-user experiment: exposing the same user to the feature being on and off over time, analyzing how they behave in different time windows.
Interleaved experiment: for ranking type of applications, e.g. ranking or preferences, exposing the same user to the A and B side at the same time.
Inter-user experiment: A/B testing, two groups of users on the A and B side. A refinement is two cohorts of users, which match up the parameters in the two groups.

Cohort includes the people who enter the experiment at the same tim. Targeting a cohort from a population can avoid diluting the effect size due to targeting unrelated users. It can also reduce the traffic.

Sizing

Although the power analysis has been discussed in Lesson 1, it provides the analytical solution. If in an experiment the unit of diversion is different from the unit of analysis, the analytical variance would underestimate the empirical variance. Thus, the sample size would be larger (even 4 times larger based on Google’s paper).

Analytic variability underestimate for cookie diversion. For example, 5000 pageviews in each group to estimate the empirical variance. Randomly sample pageviews by pageview to simulate a unit-of-diversion being pageview to to estimate the variance. Randomly sample pageviews by cookie to simulate the unit-of-diverison being cookie.

As a result, the sample size for cookie diversion is significantly larger than the one for event diversion.

As the standard error from cookie-based diversion cannot be directly calculated by the analytic formula, the above figure can be used to find the empirical SE. There are a few ways to reduce the sample of an experiment as follows.

The strategies to reduce the size of an experiment.

Increase the effect size or the error (Type I or Type II) can surely reduce the size. Using the proper unit of diversion can reduce the variability and reduce the size. “Targeting experiment to specific traffic” is an interesting one. For example, only targeting the experiment to English traffic. Because the non-English traffic is not affected, including it may dilute the results of the experiment (reduce the effect in the experiment group) and result in increasing the size. It could also impact the choice of practical significance boundary so that a bigger change matters for the business (increase the effect size). And the variability will probably go down, it can reduce the size or target smaller changes.

Changing the metric from a click-through-rate to a click-through-probability is likely to not make a significant difference. If it does though, it would reduce the number of page views needed because the unit of diversion being user-based matches the unit of analysis and the variability will go down.

Duration vs. Exposure

If the size of experiment is calculated, the next step is to think about duration-how long to run the experiment, and exposure-what percentage of the traffic will be included for the experiments. Based on the averaged traffic per day, one also needs to consider how long the experiment will be run. Consider longer duration in order to reduce the effect of specific days or seasonality effect. Plot the data over time to understand the seasonality or periotic pattern. If the change is highly risky, one may expose it to only a subset of the traffic, which means running the experiment for a longer time.

Things that need to be considered for duration and exposure

Learning Effects

The learning effect includes change a version and novelty effect.Over time either of the two effects will converge to a plateau behavior. The key issue to measure a learning effect is time. It takes time to adapt to a change but often times you don’t have that much time to make a decision. If time is not a problem, a few other things need to be considered when measuring the learning effects:

Choosing the unit of diversion correctly. Need a stateful unit of diversion like a cookie or a user ID.
The learning is not just based on time (duration)but how often they see the change — dosage. So better choose a cohort in both control and experiment based on how long the cohort’ve been exposed to the change or how many times they’ve seen it.
Risk and duration. Don’t want to put a lot of users through a change that is being tested for a long time because it may test other changes. And it is probably a high-risk change. So better run it through a small proportion of users for a longer period of time.
Pre-periods and post-periods: both pre-periods and post-periods are uniformity trials, A/A test. Before you run your A/B test on your experiment and control, it is an A/A test on a pre-period and on the exact same populations, they receive the exact treatments. In the pre-period, the difference measured between control and experiment is due to something else, such as system variability or user variability. (useful not just for measuring user learning). In the post-period, run another A/A test after A/B test ??? what is A versus A now? any result from the A/B test? then the change is due to the learning effect during that time. ???

Summary of Lesson 4

Choose proper unit of diversion and unit of analysis. If they are not consistent, the analytical variability may underestimate the empirical variability. As a result, the sample size may increase. Targeting the proper cohort may reduce the size. To run the experiment, the duration of the experiment and the exposure of traffic need to be considered.

Some thoughts:

This article from Zhihu analyzes an interview question for A/B testing from Facebook. The question is that how to perform A/B testing when we choose the user-based unit of diversion and the users in A and B groups have network effect. The network effect means, for example, the users in the two groups may be not independent (e.g. they may know each other) so their behavior may have correlations. Thus, the difference between the two groups may be underestimated or overestimated. How to reduce the network effect in order to have more accurate results from A/B testing?

This follow-up article gives one approach, difference in difference, with 5steps:

Find two groups of users that don’t know each other. A common and reliable way is to choose users from different locations to reduce the network effect. (reducing network effect)
Before running the experiment, measure the metrics as an evaluation of the default behaviors or baseline of the two groups of users.
Perform the experiment and measure the metrics again.
Calculate the difference before and after the experiment for each of the two groups. This difference eliminates the different of their default behaviors. What may cause the difference, the change or just the time? (reducing the effect due to different users themselves)
Calculate the difference of the difference from step 4 between the two groups. Now the final difference represents the difference caused by the change. (this is the real effect from the change)

This idea is similar to the review paper about propensity score I read before. I should read more about Causal inference.