Experiment Design Guidelines for Product Analysts

By Elisabeth Reitmayr

This is the third article in a three-part series that aims to add clarity and transparency around the way we work at ResearchGate. The first article covers the work that is required before starting an experiment, while the second article discusses the setup of an experiment.

At ResearchGate, we run a lot of experiments to improve our product for our users. Our experiment design guidelines for product analysts establish the process for setting up those experiments from the analytical and statistics perspective to ensure we can evaluate the experiment as intended. These guidelines give some hints, but do not fully cover the product management, user research and design perspective, i.e. what to experiment on. In the third part of this series, we focus on the sampling.

We are interested in your thoughts on these guidelines. Please send any feedback to elisabeth.reitmayr@researchgate.net.

Sampling

Sampling is an important element of experiment design. As we use the sample to make inferences about the population we are interested in, it is important to choose the right target group using an appropriate sampling mechanism to avoid bias. When we have bias in an experiment, this means that we do not adequately represent the population we are studying (read more about statistical bias here). To draw a valid conclusion from the experiment, it is also important that the sample is large enough for the effect we want to detect (see Part 1).

Target group and sampling mechanism

The target group should be representative of the population we want to make an inference about. This means that if we run a test on a feature that can only be used by users who fulfil certain conditions (e.g. have a publication, are new to ResearchGate), both the experimental and the control group should only consist of users who fulfil this condition. Otherwise, we will introduce a selection bias (e.g. because users who have publications tend to be more active than users who do not have publications).

Sometimes we want to expose the experiment only to a certain segment of users. Let’s say we want to make it easier for Principal Investigators (leaders of a scientific lab) to add their lab to ResearchGate. In this case, all Principal Investigators on ResearchGate represent our population, and we should draw both samples randomly from the population of Principal Investigators.

In rare cases, a stratified sample might help if you have a very small population or in case your sample was already drawn in a biased way. For example, if you want to expose a new feature only to a small group of beta testers, you should be aware that this will not be representative of the population as more engaged users tend to be overrepresented in beta testing groups. (They tend to be more likely to volunteer for beta groups.) Therefore, you can draw a stratified sample from the beta group to make sure the distribution of the engagement levels in your sample mirrors the distribution of engagement levels in your population. Read more here on how to recover from selection bias using a Bayesian approach).

Here are a few questions to check whether the sampling mechanism is appropriate:

Is your experiment exposed exactly to the users that you want to address (i.e. representative of the population you are interested in)?
Is there potentially a bias in the way your audience is selected? (e.g., only new users/engaged users, etc.)
Is each user only exposed to one variant?

Sample size calculation

The required sample size for an experiment depends on several factors:

Minimum detectable effect: the minimal effect that you are expecting to see from the change you introduce. This corresponds to the quantified expectation, as explained in Part 1. The smaller the effect we want to detect, the larger our sample needs to be.
Statistical reliability (reliability that the effect we detected is actually there) and statistical power (power to detect an effect when there is one): There is a trade-off relationship between reliability and power; generally: The higher the statistical reliability/power, the larger our sample needs to be. At ResearchGate, we set alpha (reliability) to 5% and beta (power) to 20%: these parameters are the same across all experiments for consistency.
Variance in the variable of interest: The higher the variance in the variable we are interested in, the larger our sample needs to be.

Sample size always has to be calculated upfront, i.e. before implementing the experiment. If the required sample size is too large, we might not even want to run the experiment but chose another research method instead. We use a Frequentist approach to evaluate the test and can use third party sample size calculators for this purpose:

Multiple testing changes the sample size requirements

In case your experiment analysis requires hypothesis testing on breakdowns or multiple comparisons, this changes your sample size requirements because of alpha-inflation. (The probability of at least one false positive increases exponentially with the number of hypotheses you are testing on.) This has to be reflected in your sample size calculation (you can apply a p-value correction in your analysis — more on this in the next blog about experiment evaluation). Read more here.

Run time

General run time requirements

Run time is influenced by sample size requirements. (The experiment may only be stopped once the required sample size is reached.)
The run time should not be too short because otherwise the sample will be biased towards more active users. (Those tend to be overrepresented in the first days since they are more likely to login and therefore more likely to join the experiment.)
Think about seasonal effects that might influence user behavior:
Time-series data in web/mobile is usually non-stationary (i.e. parameters of variables such as conversion or retention, e.g. mean, median, variance are not constant over time)
There might be seasonal effects, weekday effects, viral effects, SEO
An absolutely non-scientific rule of thumb is that most experiments should run at least a week to account for the aforementioned potential effects.

Run time requirements for Multi-armed Bandit (MAB)

The multi-armed bandit is an algorithm that automatically chooses the variant that scored the highest according to the goal that was set. Once either a pre-defined period of time, or pre-defined sample size threshold is crossed, the MAB defaults the experiment to the more successful variant for a large proportion of our users (read more here). This helps to decrease the “cost” of the experiment as we use the better-performing variant for the larger part of our user base. To set the right threshold, we need to define either the minimum sample size or the minimum run time:

Calculate how long it takes to get the required sample size upfront and set the minimum exploration period accordingly.
In case the minimum number of observations can be achieved quickly, set the minimum exploration period to at least 1 week (see previous paragraphs).

Edge cases: “soft” experiments

There will be edge cases where:

We have “medium risk” about an assumption, i.e. we ideally would want to test it experimentally; and
An experiment is not the ideal method to analyze a given question, e.g. because the traffic to the feature is too small and we would have to wait for several months to reach the required sample size

Despite the low traffic, we might still want to get a quantitative understanding of the change to limit the risk of introducing the change (ensuring “we don’t break things”). In this case, we suggest to run a “soft experiment”: we let the experiment run for e.g. 2 weeks, knowing we will not reach the required sample size to run a frequentist hypothesis test on it, in order to observe how the new variant performs. We consider this a pragmatic solution to the situation we are facing as having some data on a risky problem to make a decision is favorable to not having any data at all.

Here, we are talking about the art, rather than the science of experimentation — judgment is required. You should discuss these cases with your team (PM, Design, and User Research) to decide if a “soft experiment” is the best solution to the problem at hand. You should also make sure everyone is aware of the limitations of running the experiment only “halfway”.

Once you decide to run a “soft experiment”, do not run a Frequentist hypothesis test to evaluate the experiment results in this case. Still, we would set it up as an A/B test to see the comparison to the control variant.

Based on the typology suggested in this blog post, we can add the “soft” experiment to the grey area between high and low risk:

Image based on The Art of the Strategic Product Roadmap

‍

Overlapping or concurrent experiments

Do not run more than one experiment on the same component concurrently except in the case that you have a full-factorial design (see Part 2).

Unit of observation and independent observations

Unit of observation

Another important consideration for experimental design is the definition of the unit of observation. Theoretically, the unit of observation could be e.g. the user/the session/the user-login day etc. For example, if you were comparing which email variant made users more likely to log into RG, the unit of observation would be the user. Here it is important to consider the statistical requirement of independent observations.

Independent observations

For evaluating experiments, it is important that observations are independent. Two observations are independent if the occurrence of one observation provides no information about the occurrence of the other observation. The statistical models we use to evaluate experiments are based on the assumptions that the observations in the sample are independent. If we violate this assumption, our conclusions from the experiment might be flawed.

This implies that an experiment should in most cases be user-based, not session-based (i.e. one row in the data set to be evaluated corresponds to a user, not to a user-session). If we have multiple observations per user in our sample, the second observation for the user will not be independent from the first observation.

This also implies that each user should only participate at the experiment one time, otherwise we have the following problems:

The controls we impose on Type I and Type II error rates do not work as intended: The probability of getting a false positive is higher than the pre-defined level. Read more here and here.)
More active users are overrepresented because it is likely that they join the experiment multiple times. This means that the results are biased towards more active users.

‍

Resources

‍