Crack A/B Testing Interviews
As a data science enthusiast, A/B Testing has always been in my interview questions. In this job-hunting process, I interviewed with TikTok (Product Marketing Analyst), Whole Food (Digital Marketing Campaign Result Analyst), and Sega of America (Data Analyst). I am writing this article to organize my thoughts on the interview questions I had.
A/B Test, AKA, split test, conversion rate optimization, multivariate testing, digital optimization, online experiment, growth-hacking, has been part of the job of Data Analysts. In this article, I will generalize and illustrate A/B testing concepts and tricks based on interview situations. There will be three topics: “Where & What to A/B test”, “How to A/B test”, and “Result Analysis”. I also attached the learning road map of A/B testing tech at the end of the article if you want to dive deeper.
1. Where & What to A/B Test?
1) Website Experience 2) Paid Search 3) Mobile App 4) Social Media Content Design 5) Email & Push Notifications 6) Marketing Campaigns
Each of these topics can be expanded based on your understanding of customer journey mappings under different circumstances. There would be 4 types of elements to test on:
1) Design Related: Site Page, Flow, Elements, etc.
2) Model Related: Internal KPI and Business Models
3) Algorithms Related: Backend Functionality & Algorithms
4) New Product/ Feature Launch
It would be best if you can see the A/B testing objectives in the job descriptions. Before the interview, if you have figured out “what to A/B test”, you are already halfway there. So far, the lovely HRs already have two questions for you:
What will be the metrics to deliver results?
a. Based on “what to test”, you already have sets of metrics to answer. E.g., CTR for web design, Open Rate for Email, ROI for Ads Design, etc. You can check out the essential digital marketing concepts if you cannot list 5 KPIs for each situation.
b. Clarify the business objective with your interviewer. E.g., impressions and traffic would be great metrics for new product launching; DAU would be great for product development; ROAS and CTR would be great for getting money out of “Cash Cow”; Subscribe Rate & Churn would be great for building customer loyalty.
c. Do not forget to mention guard metrics like bounce rate or churn rate. It still takes a lot to decide when data analysts already got a significant p-value. Guard metrics can help managers to balance short-term and long-term goals.
d. I usually simply ensure the interviewer knows my thought process instead of throwing out answers in 15 seconds, which decides the performance of a project.
2.
How would you design this experiment?
Tons of questions can deviate from each subtopic in each step. A general answer could be good enough for the interviewer to ask the next question. From here, I want to pick out several high-frequency questions in A/B testing:
1. How do you decide on sample size and experiment length?
· Large-scale firms have their mature A/B testing documentations. You can either do research or make a reasonable guess to deliver more specific answers. For example, to test a button color on an Amazon page, you probably only need to run the experiment in seconds; to test a promotion strategy, you need to consider a lot more: time of day, day of the week, holidays, trends, news, platforms, spillovers, competitor behaviors. To avoid going off the topic, it is always a better solution to research before the interview.
· There are three factors influencing sample size: choice of metric, choice of diversion (page view or something with latency), and choice of population.
2. How do you do power analysis?
First question: What is Power? Power refers to the probability we correctly reject the null hypothesis. It equals to 1-Type II error (FN).
Welcome to statistics 101:
If you want to figure out the math behind it, check my reference on StatQuest at the end of the article. If you want the sample size: google “Statistical Power Calculator”. Say I set the power to 0.8, and the calculator gives me a sample size of 9. I can say if I get 9 measurements per group, I will have an 80% chance of correctly rejecting the null hypothesis.
List of Statistics Parameters:
T-Test
Three types of T-Tests:
a. Independent t-test: compare means of two groups.
b. Pair-Sample t-test: compare mean of one group before and after treatment.
c. One-Sample t-test: compare mean to a specific value.
3.
What could be interferences between experiment and control groups?
You can classify this problem into two types for the US internet industry: O2O or two-sided markets like Airbnb, Uber, eBay, and social network markets like Facebook, Instagram, and LinkedIn.
For two-sided markets, the ATE is overestimated. For example, you want to launch a new feature in Airbnb so that customers can get a coupon from the host, and the goal is to help hosts get attention from customers. The hosts in the control group will spill over to the experiment group to participate, which leads to an overestimated ATE.
How to solve it?
You can combine DiD and Geo/Time-based randomization to solve this issue.
For social network markets, the ATE is underestimated. For example, you launch a feature to drive customer engagements. After we randomize the samples, the objectives in the control group could also have access to the feature because of spillover in social networks, which leads to an underestimated ATE.
How to solve it?
1. You can create clusters to isolate users on the social network.
2. You can utilize the Ego-Network randomization strategy, which is a cluster strategy invented by LinkedIn.
4. What other reasons might you have an incorrect result?
1) Effects:
a) Novelty Effect: People welcome the changes now, but they might get less interested in the future. In this case, the ATE is overestimated.
b) Primary Effect: People are reluctant to the changes now, but they might get more interested in later stages. In this case, the ATE is underestimated.
How to address this issue:
a) Get novelty effect by comparing new users’ results from two groups.
b) DiD: Estimate novelty effect and primary effect by calculated the difference between new users and existing users first.
2) Multiple testing Problem(p-hacking):
If we have more than two variants, the p-value threshold should be smaller than 0.05 because we can reject the null hypothesis easier since we have more treatment groups.
How to address this issue:
a). Bonferroni correction: Divide the P-Value threshold by the number of tests.
b). Control FDR (False Discovery Rate).
c). Bootstrap.
d). FWER: Familywise Error Rate analysis.
3) Back to obstacles in Causality Analysis (confounding relationships):
a) Sample Selection Bias
b) Reverse Causality / Simultaneity
c) Omitted Variable Bias
Any “obstacles” you find in a later stage might require you to redesign the entire A/B test. There are techniques to solve these problems like DiD (Difference in Difference) analysis, RDD (Regression Discontinuity Design), blocking and clustering, etc.
2. How do you interpret the result?
1. What if the result is not significant?
1). Break down the tests into groups: different platforms, regions, time of day or day of the week, and customer segmentation if applicable.
2). Expand the test duration: this could lead to p-hacking, but it is feasible if there is barely any cost on the FN/FP objectives.
3). Cross Checking: blocking strategy and non-parametric sign tests. Simpson’s paradox tells us when groups combine, the estimated ATE decreases or even disappears.
2. Launch or not?
This is more of a product sense interview question. These are the 4 aspects you can dive into when answering this question:
1). What will be the good business impact? After we conclude, the CTR will increase by 20%: what will be influenced positively: revenue, profit, brand awareness, brand image?
2). Conflicting Result: When our CTR increased in the experiment group, are there conflicting results like a decrease in payment amount, or an increased cost in M&S. A very typical example would be for website design, session duration is usually not a great choice because sometimes customers stay longer because they are interested, but sometimes they are just spending extra time looking for what they need, which leads to a decrease in CTR.
3). Measure the risk: whether this launch involves any business ethics topics like customer privacies or other physical, emotional, or social concerns.
4). Balance the short-term and long-term goals. Sometimes a short-term impression increase can conflict with the brand image or company’s mission in the long run.
Study Guild to Dive Deeper
This was the road map when I learned how to use R to perform A/B testing analysis. I have my sample code on my GitHub Repo.
References
[1] Georgiev, G. Z. (2019). Statistical Methods in Online A/B Testing: Statistics for data-driven business decisions and risk management in e-commerce (1st ed.). Independently published.
[2] Amy Gallo, (2017/06/28), A Refresher on A/B Testing, Harvard Business Review, (2021/06/22). https://hbr.org/2017/06/a-refresher-on-ab-testing
[3] Kelly Peng, (2017/11/12), A Summary of Udacity A/B Testing Course, TowardsDataScience, (06/22/2021), https://towardsdatascience.com/a-summary-of-udacity-a-b-testing-course-9ecc32dedbb1
[4] Emma Ding, (2021/01/17), 7 A/B Testing Questions and Answers in Data Science Interviews, TowardsDataScience, (2021/06/22), https://towardsdatascience.com/a-summary-of-udacity-a-b-testing-course-9ecc32dedbb1
[5] Shanelle Mullin, (2020/04/10), The Complete Guide to A/B Testing: Expert Tips from Google, HubSpot and More, Shopify, (2021/06/22), https://www.shopify.com/blog/the-complete-guide-to-ab-testing
[6] YouTube. (2021/06/22), Testing Theory, A/B Testing Intro: Why, What, Where, & How to A/B Test, (2019/01/03). https://www.youtube.com/watch?v=CH89jd4haRE
[7] YouTube. (2021/06/22), Testing DataSciencePro, A/B Testing Problems in Data Science Interviews | Product Sense | Case Interview (2021/01/13). https://www.youtube.com/watch?v=X8u6kr4fxXc