This is the third article in a three-part series that aims to add clarity and transparency around the way we work at ResearchGate. The first article covers the work that is required before starting an experiment, while the second article discusses the setup of an experiment.
At ResearchGate, we run a lot of experiments to improve our product for our users. Our experiment design guidelines for product analysts establish the process for setting up those experiments from the analytical and statistics perspective to ensure we can evaluate the experiment as intended. These guidelines give some hints, but do not fully cover the product management, user research and design perspective, i.e. what to experiment on. In the third part of this series, we focus on the sampling.
We are interested in your thoughts on these guidelines. Please send any feedback to elisabeth.reitmayr@researchgate.net.
Sampling is an important element of experiment design. As we use the sample to make inferences about the population we are interested in, it is important to choose the right target group using an appropriate sampling mechanism to avoid bias. When we have bias in an experiment, this means that we do not adequately represent the population we are studying (read more about statistical bias here). To draw a valid conclusion from the experiment, it is also important that the sample is large enough for the effect we want to detect (see Part 1).
The target group should be representative of the population we want to make an inference about. This means that if we run a test on a feature that can only be used by users who fulfil certain conditions (e.g. have a publication, are new to ResearchGate), both the experimental and the control group should only consist of users who fulfil this condition. Otherwise, we will introduce a selection bias (e.g. because users who have publications tend to be more active than users who do not have publications).
Sometimes we want to expose the experiment only to a certain segment of users. Let’s say we want to make it easier for Principal Investigators (leaders of a scientific lab) to add their lab to ResearchGate. In this case, all Principal Investigators on ResearchGate represent our population, and we should draw both samples randomly from the population of Principal Investigators.
In rare cases, a stratified sample might help if you have a very small population or in case your sample was already drawn in a biased way. For example, if you want to expose a new feature only to a small group of beta testers, you should be aware that this will not be representative of the population as more engaged users tend to be overrepresented in beta testing groups. (They tend to be more likely to volunteer for beta groups.) Therefore, you can draw a stratified sample from the beta group to make sure the distribution of the engagement levels in your sample mirrors the distribution of engagement levels in your population. Read more here on how to recover from selection bias using a Bayesian approach).
Here are a few questions to check whether the sampling mechanism is appropriate:
The required sample size for an experiment depends on several factors:
Sample size always has to be calculated upfront, i.e. before implementing the experiment. If the required sample size is too large, we might not even want to run the experiment but chose another research method instead. We use a Frequentist approach to evaluate the test and can use third party sample size calculators for this purpose:
In case your experiment analysis requires hypothesis testing on breakdowns or multiple comparisons, this changes your sample size requirements because of alpha-inflation. (The probability of at least one false positive increases exponentially with the number of hypotheses you are testing on.) This has to be reflected in your sample size calculation (you can apply a p-value correction in your analysis — more on this in the next blog about experiment evaluation). Read more here.
The multi-armed bandit is an algorithm that automatically chooses the variant that scored the highest according to the goal that was set. Once either a pre-defined period of time, or pre-defined sample size threshold is crossed, the MAB defaults the experiment to the more successful variant for a large proportion of our users (read more here). This helps to decrease the “cost” of the experiment as we use the better-performing variant for the larger part of our user base. To set the right threshold, we need to define either the minimum sample size or the minimum run time:
There will be edge cases where:
Despite the low traffic, we might still want to get a quantitative understanding of the change to limit the risk of introducing the change (ensuring “we don’t break things”). In this case, we suggest to run a “soft experiment”: we let the experiment run for e.g. 2 weeks, knowing we will not reach the required sample size to run a frequentist hypothesis test on it, in order to observe how the new variant performs. We consider this a pragmatic solution to the situation we are facing as having some data on a risky problem to make a decision is favorable to not having any data at all.
Here, we are talking about the art, rather than the science of experimentation — judgment is required. You should discuss these cases with your team (PM, Design, and User Research) to decide if a “soft experiment” is the best solution to the problem at hand. You should also make sure everyone is aware of the limitations of running the experiment only “halfway”.
Once you decide to run a “soft experiment”, do not run a Frequentist hypothesis test to evaluate the experiment results in this case. Still, we would set it up as an A/B test to see the comparison to the control variant.
Based on the typology suggested in this blog post, we can add the “soft” experiment to the grey area between high and low risk:
Do not run more than one experiment on the same component concurrently except in the case that you have a full-factorial design (see Part 2).
Another important consideration for experimental design is the definition of the unit of observation. Theoretically, the unit of observation could be e.g. the user/the session/the user-login day etc. For example, if you were comparing which email variant made users more likely to log into RG, the unit of observation would be the user. Here it is important to consider the statistical requirement of independent observations.
For evaluating experiments, it is important that observations are independent. Two observations are independent if the occurrence of one observation provides no information about the occurrence of the other observation. The statistical models we use to evaluate experiments are based on the assumptions that the observations in the sample are independent. If we violate this assumption, our conclusions from the experiment might be flawed.
This implies that an experiment should in most cases be user-based, not session-based (i.e. one row in the data set to be evaluated corresponds to a user, not to a user-session). If we have multiple observations per user in our sample, the second observation for the user will not be independent from the first observation.
This also implies that each user should only participate at the experiment one time, otherwise we have the following problems: