library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)
library(flextable)
library(janitor)7 Experiment Analysis
7.1 Strengths and Limits of Experiment Analysis with SQL
I would reframe this discussion. We might be using a database because the results of an experiment are stored there.
An alternative approach might have the results of an experiment are extracted and exported to a CSV or Excel file and attached to an email and sent to you for analysis.1 But extracted from where? If the data are in a database, it would be better to cut out the middleman and just get the data directly.
The limitations of SQL essentially vanish when we access the data using dbplyr. R can do any statistical calculation we can think of, so we have no need of an online statistical calculator (though such calculators can help us frame our analyses and check our results).
Cathy does say that “many databases allow developers to extend SQL functionality with user-defined functions (UDFs) … but they are beyond the scope of this book.” For PostgreSQL there are PL/Python and PL/R, which allow creation of functions in Python and R, respectively. When I first started to use PostgreSQL, these extensions seemed pretty exciting and, because I was maintaining my own databases, I could use them. But over time, I found maintaining UDFs to be more effort than could be justified and I no longer use them. Instead, if I need to do analysis in Python or R, I will extract data from the database, do the analysis, then write data back to the database. While this likely partly reflects the kinds of analysis I do, I think that UDFs are likely to be off limits to most users because of the additional complexity of turning code into UDFs and because many users would lack sufficient database privileges to create these.
7.2 The Data Set
As discussed in Tanimura (2021), there are four tables used in this chapter. We can get these data from the GitHub site associated with Tanimura (2021).2
db <- dbConnect(duckdb::duckdb())
game_users <-
tbl(db, "read_csv_auto('data/game_users.csv')") |>
compute(name = "game_users")
game_actions <-
tbl(db, "read_csv_auto('data/game_actions.csv')") |>
compute(name = "game_actions")
game_purchases <- tbl(db, "read_csv_auto('data/game_purchases.csv')")
exp_assignment <- tbl(db, "read_csv_auto('data/exp_assignment.csv')")7.3 Types of Experiments
I would reword the first paragraph here to the following for clarity (edits in italics):
There is a wide range of experiments, If you can change something that a user, customer, constituent, or other entity experiences, you can in theory test the effect of that change on some outcome.
7.3.1 Experiments with Binary Outcomes: The Chi-Squared Test
To better match the approach of the book, I essentially create the contingency table in the database. An alternative approach would have been to collect() after summarize() and then do the statistical analysis in R. In fact, many of the functions in R are better set-up for this approach. However, this means bring more data into R and doing more calculation in R. If the experiment is very large or you would rather the database server do more of the work, then the approach below may be preferred.
cont_tbl <-
exp_assignment |>
left_join(game_actions, by = "user_id") |>
group_by(variant, user_id) |>
summarize(completed =
coalesce(any(action == "onboarding complete", na.rm = TRUE),
FALSE),
.groups = "drop") |>
count(variant, completed) |>
mutate(completed = if_else(completed, "Yes", "No")) |>
pivot_wider(names_from = completed, values_from = n) |>
collect()Using the packages janitor and flextable, I mimic the nicely formatted output shown in the book:
cont_tbl |>
adorn_totals(c("row", "col")) |>
mutate(`% Complete` = prettyNum(Yes/Total * 100, digits = 4)) |>
flextable() |>
add_header_row(values=c("", "Completed onboarding", ""),
colwidths = c(1, 2, 2))Completed onboarding | ||||
|---|---|---|---|---|
variant | Yes | No | Total | % Complete |
variant 1 | 38,280 | 11,995 | 50,275 | 76.14 |
control | 36,268 | 13,629 | 49,897 | 72.69 |
Total | 74,548 | 25,624 | 100,172 | 74.42 |
Now I can do the Chi-squared test. I need to turn variant into the row names of the contingency table so that we want a simple \(2 \times 2\) numeric table as the input to our statistical test and I use column_to_rownames() to this end. I then pipe the result into the base R function chisq.test(). I specified correct = FALSE so that my result matched what I got from the online calculator I found. I then display the Chi-squared statistic and the \(p\)-value.
res <-
cont_tbl |>
column_to_rownames(var = "variant") |>
chisq.test(correct = FALSE)
res$statisticX-squared
157.0758
res$p.value[1] 4.926953e-36
7.3.2 Experiments with Continuous Outcomes: The t-Test
amounts <-
exp_assignment |>
filter(exp_name == 'Onboarding') |>
left_join(game_purchases, by = "user_id") |>
group_by(variant, user_id) |>
summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
.groups = "drop")
t_test_stats <-
amounts |>
group_by(variant) |>
summarize(n = n(),
mean = mean(amount, na.rm = TRUE),
sd = sd(amount, na.rm = TRUE)) |>
collect()
t_test_stats |>
kable(digits = 3)| variant | n | mean | sd |
|---|---|---|---|
| variant 1 | 50275 | 3.688 | 19.22 |
| control | 49897 | 3.781 | 18.94 |
We can make a small function that we can pass a data frame to as df. The calculations assume that df contains two rows (one for each group) and columns named mean, sd, and n for the mean, standard deviation, and number of observations, respectively, in each group.
t_test <- function(df) {
mean_diff = abs(df$mean[1] - df$mean[2])
se_diff <- sqrt(sum(df$sd^2 / df$n))
t_stat <- mean_diff / se_diff
p <- pt(t_stat, df = sum(df$n))
p_val <- 2 * min(p, 1 - p)
return(list("statistic" = t_stat, "p-value" = p_val))
}
t_test(t_test_stats)$statistic
[1] 0.7765458
$`p-value`
[1] 0.4374286
These values line up with those obtained from the online calculator I found.
An alternative approach would be to collect() the underlying data and do the \(t\)-test in R.
t_test_data <-
amounts |>
select(variant, amount) |>
collect()
t.test(formula = amount ~ variant, data = t_test_data)
Welch Two Sample t-test
data: amount by variant
t = 0.77655, df = 100165, p-value = 0.4374
alternative hypothesis: true difference in means between group control and group variant 1 is not equal to 0
95 percent confidence interval:
-0.1426893 0.3299478
sample estimates:
mean in group control mean in group variant 1
3.781218 3.687589
t_test_stats_2 <-
amounts |>
inner_join(game_actions, by = "user_id") |>
filter(action == "onboarding complete") |>
group_by(variant) |>
summarize(n = n(),
mean = mean(amount, na.rm = TRUE),
sd = sd(amount, na.rm = TRUE)) |>
collect()
t_test_stats_2 |>
kable(digits = 3)| variant | n | mean | sd |
|---|---|---|---|
| variant 1 | 38280 | 4.843 | 21.899 |
| control | 36268 | 5.202 | 22.049 |
t_test(t_test_stats_2)$statistic
[1] 2.229649
$`p-value`
[1] 0.02577371
7.4 Challenges with Experiments and Options for Rescuing Flawed Experiments
7.4.1 Variant Assignment
7.4.2 Outliers
exp_assignment |>
left_join(game_purchases, by = "user_id", keep = TRUE,
suffix = c("", ".y")) |>
inner_join(game_actions, by = "user_id") |>
filter(action == "onboarding complete",
exp_name == 'Onboarding') |>
group_by(variant) |>
summarize(total_cohorted = n_distinct(user_id),
purchasers = n_distinct(user_id.y),
.groups = "drop") |>
mutate(pct_purchased = purchasers * 100.0 / total_cohorted) |>
kable(digits = 2)| variant | total_cohorted | purchasers | pct_purchased |
|---|---|---|---|
| control | 36268 | 4988 | 13.75 |
| variant 1 | 38280 | 4981 | 13.01 |
7.4.3 Time Boxing
amounts_boxed <-
exp_assignment |>
filter(exp_name == 'Onboarding') |>
mutate(exp_end = exp_date + days(7)) |>
left_join(game_purchases,
by = join_by(user_id, exp_end >= purch_date)) |>
group_by(variant, user_id) |>
summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
.groups = "drop")
t_test_stats_boxed <-
amounts_boxed |>
group_by(variant) |>
summarize(n = n(),
mean = mean(amount, na.rm = TRUE),
sd = sd(amount, na.rm = TRUE)) |>
collect()
t_test_stats_boxed |>
kable(digits = 3)| variant | n | mean | sd |
|---|---|---|---|
| variant 1 | 50275 | 1.352 | 5.613 |
| control | 49897 | 1.369 | 5.766 |
t_test(t_test_stats_boxed)$statistic
[1] 0.4920715
$`p-value`
[1] 0.6226699
7.4.4 Pre-Post Analysis
The original SQL query is something like this, but we make some tweaks to get this to work more naturally using dbplyr.
SELECT
CASE WHEN a.created BETWEEN '2020-01-13' AND '2020-01-26' THEN 'pre'
WHEN a.created BETWEEN '2020-01-27' AND '2020-02-09' THEN 'post'
END AS variant,
count(distinct a.user_id) AS cohorted,
count(distinct b.user_id) AS opted_in,
count(distinct b.user_id) * 1.0 / count(DISTINCT a.user_id) AS pct_optin,
count(distinct a.created) AS days
FROM game_users a
LEFT JOIN game_actions b ON a.user_id = b.user_id
AND b.action = 'email_optin'
WHERE a.created BETWEEN '2020-01-13' AND '2020-02-09'
GROUP BY 1
;The first tweak is to translate the b.action = 'email_optin' part of the join into a filter, which we embed in the left_join() portion of our code (mirroring how it works in the SQL). The second tweak is to move the pct_optin out of the grouped aggregation portion of the query, as in the SQL it is referring to what we will later have as opted_in and cohorted. (In fact, I don’t bother with `pct_optin at all, as it’s easy to calculate when we use it in making a table.)
Note that COUNT(DISTINCT col) becomes n_distinct(col). We use keep = TRUE and suffix = c("", ".y") to store what is b.user_id in the SQL as user_id.y.
We calculate not_opted_in so that our table is better prepared for the chisq.test() we will pass it to later on.
opt_in_df <-
game_users |>
filter(between(created,
as.Date('2020-01-13'),
as.Date('2020-02-09'))) |>
left_join(game_actions |>
filter(action == 'email_optin'), by = "user_id",
keep = TRUE,
suffix = c("", ".y")) |>
mutate(variant = if_else(created <= '2020-01-26', 'pre', 'post')) |>
group_by(variant) |>
summarize(cohorted = n_distinct(user_id),
opted_in = n_distinct(user_id.y),
days = n_distinct(created),
.groups = "drop") |>
mutate(not_opted_in = cohorted - opted_in) |>
collect()opt_in_df |>
select(variant, days, opted_in, not_opted_in, cohorted) |>
adorn_totals(c("row")) |>
mutate(`% opted in` = prettyNum(100.0 * opted_in/cohorted, digits = 4)) |>
rename(Yes = opted_in, No = not_opted_in, Total = cohorted) |>
flextable() |>
add_header_row(values=c("", "Opted in", ""),
colwidths = c(2, 2, 2))Opted in | |||||
|---|---|---|---|---|---|
variant | days | Yes | No | Total | % opted in |
pre | 14 | 14,489 | 10,173 | 24,662 | 58.75 |
post | 14 | 11,220 | 16,397 | 27,617 | 40.63 |
Total | 28 | 25,709 | 26,570 | 52,279 | 49.18 |
res <-
opt_in_df |>
select(-cohorted, -days) |>
column_to_rownames(var = "variant") |>
chisq.test(correct = FALSE)
res$statisticX-squared
1712.075
res$p.value[1] 0
7.4.5 Natural experiment analysis
by_country <-
game_users |>
filter(country %in% c('United States', 'Canada')) |>
left_join(game_purchases, by = "user_id",
keep = TRUE,
suffix = c("", ".y")) |>
group_by(country) |>
summarize(cohorted = n_distinct(user_id),
opted_in = n_distinct(user_id.y),
.groups = "drop") |>
mutate(pct_purchased = 1.0 * opted_in/cohorted) |>
collect()
by_country |> kable()| country | cohorted | opted_in | pct_purchased |
|---|---|---|---|
| United States | 45012 | 4958 | 0.1101484 |
| Canada | 20179 | 5011 | 0.2483275 |
res <-
by_country |>
select(-pct_purchased) |>
column_to_rownames(var = "country") |>
chisq.test(correct = FALSE)
res$statisticX-squared
1447.273
res$p.value[1] 1.122178e-316
To avoid a warning, we disconnect from the database once we’re done with it.
dbDisconnect(db, shutdown = TRUE)