library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)
library(flextable)
library(janitor)
7 Experiment Analysis
7.1 Strengths and Limits of Experiment Analysis with SQL
I would reframe this discussion. We might be using a database because the results of an experiment are stored there.
An alternative approach might have the results of an experiment are extracted and exported to a CSV or Excel file and attached to an email and sent to you for analysis.1 But extracted from where? If the data are in a database, it would be better to cut out the middleman and just get the data directly.
The limitations of SQL essentially vanish when we access the data using dbplyr
. R can do any statistical calculation we can think of, so we have no need of an online statistical calculator (though such calculators can help us frame our analyses and check our results).
Cathy does say that “many databases allow developers to extend SQL functionality with user-defined functions (UDFs) … but they are beyond the scope of this book.” For PostgreSQL there are PL/Python and PL/R, which allow creation of functions in Python and R, respectively. When I first started to use PostgreSQL, these extensions seemed pretty exciting and, because I was maintaining my own databases, I could use them. But over time, I found maintaining UDFs to be more effort than could be justified and I no longer use them. Instead, if I need to do analysis in Python or R, I will extract data from the database, do the analysis, then write data back to the database. While this likely partly reflects the kinds of analysis I do, I think that UDFs are likely to be off limits to most users because of the additional complexity of turning code into UDFs and because many users would lack sufficient database privileges to create these.
7.2 The Data Set
As discussed in Tanimura (2021), there are four tables used in this chapter. We can get these data from the GitHub site associated with Tanimura (2021).2
<- dbConnect(duckdb::duckdb())
db
<-
game_users tbl(db, "read_csv_auto('data/game_users.csv')") |>
compute(name = "game_users")
<-
game_actions tbl(db, "read_csv_auto('data/game_actions.csv')") |>
compute(name = "game_actions")
<- tbl(db, "read_csv_auto('data/game_purchases.csv')")
game_purchases <- tbl(db, "read_csv_auto('data/exp_assignment.csv')") exp_assignment
7.3 Types of Experiments
I would reword the first paragraph here to the following for clarity (edits in italics):
There is a wide range of experiments, If you can change something that a user, customer, constituent, or other entity experiences, you can in theory test the effect of that change on some outcome.
7.3.1 Experiments with Binary Outcomes: The Chi-Squared Test
To better match the approach of the book, I essentially create the contingency table in the database. An alternative approach would have been to collect()
after summarize()
and then do the statistical analysis in R. In fact, many of the functions in R are better set-up for this approach. However, this means bring more data into R and doing more calculation in R. If the experiment is very large or you would rather the database server do more of the work, then the approach below may be preferred.
<-
cont_tbl |>
exp_assignment left_join(game_actions, by = "user_id") |>
group_by(variant, user_id) |>
summarize(completed =
coalesce(any(action == "onboarding complete", na.rm = TRUE),
FALSE),
.groups = "drop") |>
count(variant, completed) |>
mutate(completed = if_else(completed, "Yes", "No")) |>
pivot_wider(names_from = completed, values_from = n) |>
collect()
Using the packages janitor
and flextable
, I mimic the nicely formatted output shown in the book:
|>
cont_tbl adorn_totals(c("row", "col")) |>
mutate(`% Complete` = prettyNum(Yes/Total * 100, digits = 4)) |>
flextable() |>
add_header_row(values=c("", "Completed onboarding", ""),
colwidths = c(1, 2, 2))
Completed onboarding | ||||
---|---|---|---|---|
variant | Yes | No | Total | % Complete |
variant 1 | 38,280 | 11,995 | 50,275 | 76.14 |
control | 36,268 | 13,629 | 49,897 | 72.69 |
Total | 74,548 | 25,624 | 100,172 | 74.42 |
Now I can do the Chi-squared test. I need to turn variant
into the row names of the contingency table so that we want a simple \(2 \times 2\) numeric table as the input to our statistical test and I use column_to_rownames()
to this end. I then pipe the result into the base R function chisq.test()
. I specified correct = FALSE
so that my result matched what I got from the online calculator I found. I then display the Chi-squared statistic and the \(p\)-value.
<-
res |>
cont_tbl column_to_rownames(var = "variant") |>
chisq.test(correct = FALSE)
$statistic res
X-squared
157.0758
$p.value res
[1] 4.926953e-36
7.3.2 Experiments with Continuous Outcomes: The t-Test
<-
amounts |>
exp_assignment filter(exp_name == 'Onboarding') |>
left_join(game_purchases, by = "user_id") |>
group_by(variant, user_id) |>
summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
.groups = "drop")
<-
t_test_stats |>
amounts group_by(variant) |>
summarize(n = n(),
mean = mean(amount, na.rm = TRUE),
sd = sd(amount, na.rm = TRUE)) |>
collect()
|>
t_test_stats kable(digits = 3)
variant | n | mean | sd |
---|---|---|---|
variant 1 | 50275 | 3.688 | 19.22 |
control | 49897 | 3.781 | 18.94 |
We can make a small function that we can pass a data frame to as df
. The calculations assume that df
contains two rows (one for each group) and columns named mean
, sd
, and n
for the mean, standard deviation, and number of observations, respectively, in each group.
<- function(df) {
t_test = abs(df$mean[1] - df$mean[2])
mean_diff <- sqrt(sum(df$sd^2 / df$n))
se_diff <- mean_diff / se_diff
t_stat <- pt(t_stat, df = sum(df$n))
p <- 2 * min(p, 1 - p)
p_val return(list("statistic" = t_stat, "p-value" = p_val))
}
t_test(t_test_stats)
$statistic
[1] 0.7765458
$`p-value`
[1] 0.4374286
These values line up with those obtained from the online calculator I found.
An alternative approach would be to collect()
the underlying data and do the \(t\)-test in R.
<-
t_test_data |>
amounts select(variant, amount) |>
collect()
t.test(formula = amount ~ variant, data = t_test_data)
Welch Two Sample t-test
data: amount by variant
t = 0.77655, df = 100165, p-value = 0.4374
alternative hypothesis: true difference in means between group control and group variant 1 is not equal to 0
95 percent confidence interval:
-0.1426893 0.3299478
sample estimates:
mean in group control mean in group variant 1
3.781218 3.687589
<-
t_test_stats_2 |>
amounts inner_join(game_actions, by = "user_id") |>
filter(action == "onboarding complete") |>
group_by(variant) |>
summarize(n = n(),
mean = mean(amount, na.rm = TRUE),
sd = sd(amount, na.rm = TRUE)) |>
collect()
|>
t_test_stats_2 kable(digits = 3)
variant | n | mean | sd |
---|---|---|---|
variant 1 | 38280 | 4.843 | 21.899 |
control | 36268 | 5.202 | 22.049 |
t_test(t_test_stats_2)
$statistic
[1] 2.229649
$`p-value`
[1] 0.02577371
7.4 Challenges with Experiments and Options for Rescuing Flawed Experiments
7.4.1 Variant Assignment
7.4.2 Outliers
|>
exp_assignment left_join(game_purchases, by = "user_id", keep = TRUE,
suffix = c("", ".y")) |>
inner_join(game_actions, by = "user_id") |>
filter(action == "onboarding complete",
== 'Onboarding') |>
exp_name group_by(variant) |>
summarize(total_cohorted = n_distinct(user_id),
purchasers = n_distinct(user_id.y),
.groups = "drop") |>
mutate(pct_purchased = purchasers * 100.0 / total_cohorted) |>
kable(digits = 2)
variant | total_cohorted | purchasers | pct_purchased |
---|---|---|---|
control | 36268 | 4988 | 13.75 |
variant 1 | 38280 | 4981 | 13.01 |
7.4.3 Time Boxing
<-
amounts_boxed |>
exp_assignment filter(exp_name == 'Onboarding') |>
mutate(exp_end = exp_date + days(7)) |>
left_join(game_purchases,
by = join_by(user_id, exp_end >= purch_date)) |>
group_by(variant, user_id) |>
summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
.groups = "drop")
<-
t_test_stats_boxed |>
amounts_boxed group_by(variant) |>
summarize(n = n(),
mean = mean(amount, na.rm = TRUE),
sd = sd(amount, na.rm = TRUE)) |>
collect()
|>
t_test_stats_boxed kable(digits = 3)
variant | n | mean | sd |
---|---|---|---|
variant 1 | 50275 | 1.352 | 5.613 |
control | 49897 | 1.369 | 5.766 |
t_test(t_test_stats_boxed)
$statistic
[1] 0.4920715
$`p-value`
[1] 0.6226699
7.4.4 Pre-Post Analysis
The original SQL query is something like this, but we make some tweaks to get this to work more naturally using dbplyr
.
SELECT
CASE WHEN a.created BETWEEN '2020-01-13' AND '2020-01-26' THEN 'pre'
WHEN a.created BETWEEN '2020-01-27' AND '2020-02-09' THEN 'post'
END AS variant,
count(distinct a.user_id) AS cohorted,
count(distinct b.user_id) AS opted_in,
count(distinct b.user_id) * 1.0 / count(DISTINCT a.user_id) AS pct_optin,
count(distinct a.created) AS days
FROM game_users a
LEFT JOIN game_actions b ON a.user_id = b.user_id
AND b.action = 'email_optin'
WHERE a.created BETWEEN '2020-01-13' AND '2020-02-09'
GROUP BY 1
;
The first tweak is to translate the b.action = 'email_optin'
part of the join into a filter, which we embed in the left_join()
portion of our code (mirroring how it works in the SQL). The second tweak is to move the pct_optin
out of the grouped aggregation portion of the query, as in the SQL it is referring to what we will later have as opted_in
and cohorted
. (In fact, I don’t bother with `pct_optin
at all, as it’s easy to calculate when we use it in making a table.)
Note that COUNT(DISTINCT col)
becomes n_distinct(col)
. We use keep = TRUE
and suffix = c("", ".y")
to store what is b.user_id
in the SQL as user_id.y
.
We calculate not_opted_in
so that our table is better prepared for the chisq.test()
we will pass it to later on.
<-
opt_in_df |>
game_users filter(between(created,
as.Date('2020-01-13'),
as.Date('2020-02-09'))) |>
left_join(game_actions |>
filter(action == 'email_optin'), by = "user_id",
keep = TRUE,
suffix = c("", ".y")) |>
mutate(variant = if_else(created <= '2020-01-26', 'pre', 'post')) |>
group_by(variant) |>
summarize(cohorted = n_distinct(user_id),
opted_in = n_distinct(user_id.y),
days = n_distinct(created),
.groups = "drop") |>
mutate(not_opted_in = cohorted - opted_in) |>
collect()
|>
opt_in_df select(variant, days, opted_in, not_opted_in, cohorted) |>
adorn_totals(c("row")) |>
mutate(`% opted in` = prettyNum(100.0 * opted_in/cohorted, digits = 4)) |>
rename(Yes = opted_in, No = not_opted_in, Total = cohorted) |>
flextable() |>
add_header_row(values=c("", "Opted in", ""),
colwidths = c(2, 2, 2))
Opted in | |||||
---|---|---|---|---|---|
variant | days | Yes | No | Total | % opted in |
pre | 14 | 14,489 | 10,173 | 24,662 | 58.75 |
post | 14 | 11,220 | 16,397 | 27,617 | 40.63 |
Total | 28 | 25,709 | 26,570 | 52,279 | 49.18 |
<-
res |>
opt_in_df select(-cohorted, -days) |>
column_to_rownames(var = "variant") |>
chisq.test(correct = FALSE)
$statistic res
X-squared
1712.075
$p.value res
[1] 0
7.4.5 Natural experiment analysis
<-
by_country |>
game_users filter(country %in% c('United States', 'Canada')) |>
left_join(game_purchases, by = "user_id",
keep = TRUE,
suffix = c("", ".y")) |>
group_by(country) |>
summarize(cohorted = n_distinct(user_id),
opted_in = n_distinct(user_id.y),
.groups = "drop") |>
mutate(pct_purchased = 1.0 * opted_in/cohorted) |>
collect()
|> kable() by_country
country | cohorted | opted_in | pct_purchased |
---|---|---|---|
United States | 45012 | 4958 | 0.1101484 |
Canada | 20179 | 5011 | 0.2483275 |
<-
res |>
by_country select(-pct_purchased) |>
column_to_rownames(var = "country") |>
chisq.test(correct = FALSE)
$statistic res
X-squared
1447.273
$p.value res
[1] 1.122178e-316
To avoid a warning, we disconnect from the database once we’re done with it.
dbDisconnect(db, shutdown = TRUE)