7 Experiment Analysis

7.1 Strengths and Limits of Experiment Analysis with SQL

I would reframe this discussion. We might be using a database because the results of an experiment are stored there.

An alternative approach might have the results of an experiment are extracted and exported to a CSV or Excel file and attached to an email and sent to you for analysis.¹ But extracted from where? If the data are in a database, it would be better to cut out the middleman and just get the data directly.

The limitations of SQL essentially vanish when we access the data using dbplyr. R can do any statistical calculation we can think of, so we have no need of an online statistical calculator (though such calculators can help us frame our analyses and check our results).

Cathy does say that “many databases allow developers to extend SQL functionality with user-defined functions (UDFs) … but they are beyond the scope of this book.” For PostgreSQL there are PL/Python and PL/R, which allow creation of functions in Python and R, respectively. When I first started to use PostgreSQL, these extensions seemed pretty exciting and, because I was maintaining my own databases, I could use them. But over time, I found maintaining UDFs to be more effort than could be justified and I no longer use them. Instead, if I need to do analysis in Python or R, I will extract data from the database, do the analysis, then write data back to the database. While this likely partly reflects the kinds of analysis I do, I think that UDFs are likely to be off limits to most users because of the additional complexity of turning code into UDFs and because many users would lack sufficient database privileges to create these.

library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)
library(flextable)
library(janitor)

7.2 The Data Set

As discussed in Tanimura (2021), there are four tables used in this chapter. We can get these data from the GitHub site associated with Tanimura (2021).²

db <- dbConnect(duckdb::duckdb())

game_users <-
  tbl(db, "read_csv_auto('data/game_users.csv')") |>
  compute(name = "game_users")
game_actions <- 
  tbl(db, "read_csv_auto('data/game_actions.csv')") |>
  compute(name = "game_actions")
game_purchases <- tbl(db, "read_csv_auto('data/game_purchases.csv')")
exp_assignment <- tbl(db, "read_csv_auto('data/exp_assignment.csv')")

7.3 Types of Experiments

I would reword the first paragraph here to the following for clarity (edits in italics):

There is a wide range of experiments, If you can change something that a user, customer, constituent, or other entity experiences, you can in theory test the effect of that change on some outcome.

7.3.1 Experiments with Binary Outcomes: The Chi-Squared Test

To better match the approach of the book, I essentially create the contingency table in the database. An alternative approach would have been to collect() after summarize() and then do the statistical analysis in R. In fact, many of the functions in R are better set-up for this approach. However, this means bring more data into R and doing more calculation in R. If the experiment is very large or you would rather the database server do more of the work, then the approach below may be preferred.

cont_tbl <-
  exp_assignment |>
  left_join(game_actions, by = "user_id") |>
  group_by(variant, user_id) |>
  summarize(completed = 
              coalesce(any(action == "onboarding complete", na.rm = TRUE),
                       FALSE),
            .groups = "drop") |>
  count(variant, completed) |>
  mutate(completed = if_else(completed, "Yes", "No")) |>
  pivot_wider(names_from = completed, values_from = n) |>
  collect()

Using the packages janitor and flextable, I mimic the nicely formatted output shown in the book:

cont_tbl |>
  adorn_totals(c("row", "col")) |>
  mutate(`% Complete` = prettyNum(Yes/Total * 100, digits = 4)) |>
  flextable() |>
  add_header_row(values=c("", "Completed onboarding", ""),
                 colwidths = c(1, 2, 2))

	Completed onboarding
variant	Yes	No	Total	% Complete
variant 1	38,280	11,995	50,275	76.14
control	36,268	13,629	49,897	72.69
Total	74,548	25,624	100,172	74.42

Now I can do the Chi-squared test. I need to turn variant into the row names of the contingency table so that we want a simple \(2 \times 2\) numeric table as the input to our statistical test and I use column_to_rownames() to this end. I then pipe the result into the base R function chisq.test(). I specified correct = FALSE so that my result matched what I got from the online calculator I found. I then display the Chi-squared statistic and the \(p\)-value.

res <- 
  cont_tbl |>
  column_to_rownames(var = "variant") |>
  chisq.test(correct = FALSE)

res$statistic

X-squared 
 157.0758

res$p.value

[1] 4.926953e-36

7.3.2 Experiments with Continuous Outcomes: The t-Test

amounts <- 
  exp_assignment |>
  filter(exp_name == 'Onboarding') |>
  left_join(game_purchases, by = "user_id") |>
  group_by(variant, user_id) |>
  summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
            .groups = "drop")

t_test_stats <-
  amounts |>
  group_by(variant) |>
  summarize(n = n(),
            mean = mean(amount, na.rm = TRUE),
            sd = sd(amount, na.rm = TRUE)) |>
  collect()

t_test_stats |>
  kable(digits = 3)

variant	n	mean	sd
variant 1	50275	3.688	19.22
control	49897	3.781	18.94

We can make a small function that we can pass a data frame to as df. The calculations assume that df contains two rows (one for each group) and columns named mean, sd, and n for the mean, standard deviation, and number of observations, respectively, in each group.

t_test <- function(df) {
  mean_diff = abs(df$mean[1] - df$mean[2])
  se_diff <- sqrt(sum(df$sd^2 / df$n))
  t_stat <- mean_diff / se_diff
  p <- pt(t_stat, df = sum(df$n))
  p_val <- 2 * min(p, 1 - p)
  return(list("statistic" = t_stat, "p-value" = p_val))
}

t_test(t_test_stats)

$statistic
[1] 0.7765458

$`p-value`
[1] 0.4374286

These values line up with those obtained from the online calculator I found.

An alternative approach would be to collect() the underlying data and do the \(t\)-test in R.

t_test_data <-
  amounts |>
  select(variant, amount) |>
  collect()

t.test(formula = amount ~ variant, data = t_test_data)


    Welch Two Sample t-test

data:  amount by variant
t = 0.77655, df = 100165, p-value = 0.4374
alternative hypothesis: true difference in means between group control and group variant 1 is not equal to 0
95 percent confidence interval:
 -0.1426893  0.3299478
sample estimates:
  mean in group control mean in group variant 1 
               3.781218                3.687589

t_test_stats_2 <-
  amounts |>
  inner_join(game_actions, by = "user_id") |>
  filter(action == "onboarding complete") |>
  group_by(variant) |>
  summarize(n = n(),
            mean = mean(amount, na.rm = TRUE),
            sd = sd(amount, na.rm = TRUE)) |>
  collect()

t_test_stats_2 |>
  kable(digits = 3)

variant	n	mean	sd
variant 1	38280	4.843	21.899
control	36268	5.202	22.049

t_test(t_test_stats_2)

$statistic
[1] 2.229649

$`p-value`
[1] 0.02577371

7.4 Challenges with Experiments and Options for Rescuing Flawed Experiments

7.4.1 Variant Assignment

7.4.2 Outliers

exp_assignment |>
  left_join(game_purchases, by = "user_id", keep = TRUE,
            suffix = c("", ".y")) |>
  inner_join(game_actions, by = "user_id") |>
  filter(action == "onboarding complete",
         exp_name == 'Onboarding') |>
  group_by(variant) |>
  summarize(total_cohorted = n_distinct(user_id),
            purchasers = n_distinct(user_id.y),
            .groups = "drop") |>
  mutate(pct_purchased = purchasers * 100.0 / total_cohorted) |>
  kable(digits = 2)

variant	total_cohorted	purchasers	pct_purchased
control	36268	4988	13.75
variant 1	38280	4981	13.01

7.4.3 Time Boxing

amounts_boxed <- 
  exp_assignment |>
  filter(exp_name == 'Onboarding') |>
  mutate(exp_end = exp_date + days(7)) |>
  left_join(game_purchases, 
            by = join_by(user_id, exp_end >= purch_date)) |>
  group_by(variant, user_id) |>
  summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
            .groups = "drop")


t_test_stats_boxed <-
  amounts_boxed |>
  group_by(variant) |>
  summarize(n = n(),
            mean = mean(amount, na.rm = TRUE),
            sd = sd(amount, na.rm = TRUE)) |>
  collect()

t_test_stats_boxed |>
  kable(digits = 3)

variant	n	mean	sd
variant 1	50275	1.352	5.613
control	49897	1.369	5.766

t_test(t_test_stats_boxed)

$statistic
[1] 0.4920715

$`p-value`
[1] 0.6226699

7.4.4 Pre-Post Analysis

The original SQL query is something like this, but we make some tweaks to get this to work more naturally using dbplyr.

SELECT 
  CASE WHEN a.created BETWEEN '2020-01-13' AND '2020-01-26' THEN 'pre'
     WHEN a.created BETWEEN '2020-01-27' AND '2020-02-09' THEN 'post'
     END AS variant,
  count(distinct a.user_id) AS cohorted,
  count(distinct b.user_id) AS opted_in,
  count(distinct b.user_id) * 1.0 / count(DISTINCT a.user_id) AS pct_optin,
  count(distinct a.created) AS days
FROM game_users a
LEFT JOIN game_actions b ON a.user_id = b.user_id 
  AND b.action = 'email_optin'
WHERE a.created BETWEEN '2020-01-13' AND '2020-02-09'
GROUP BY 1
;

The first tweak is to translate the b.action = 'email_optin' part of the join into a filter, which we embed in the left_join() portion of our code (mirroring how it works in the SQL). The second tweak is to move the pct_optin out of the grouped aggregation portion of the query, as in the SQL it is referring to what we will later have as opted_in and cohorted. (In fact, I don’t bother with `pct_optin at all, as it’s easy to calculate when we use it in making a table.)

Note that COUNT(DISTINCT col) becomes n_distinct(col). We use keep = TRUE and suffix = c("", ".y") to store what is b.user_id in the SQL as user_id.y.

We calculate not_opted_in so that our table is better prepared for the chisq.test() we will pass it to later on.

opt_in_df <-
  game_users |>
  filter(between(created, 
                 as.Date('2020-01-13'), 
                 as.Date('2020-02-09'))) |>
  left_join(game_actions |> 
              filter(action == 'email_optin'), by = "user_id",
            keep = TRUE,
            suffix = c("", ".y")) |>
  mutate(variant = if_else(created <= '2020-01-26', 'pre', 'post')) |>
  group_by(variant) |>
  summarize(cohorted = n_distinct(user_id),
            opted_in = n_distinct(user_id.y),
            days = n_distinct(created),
            .groups = "drop") |>
  mutate(not_opted_in = cohorted - opted_in) |>
  collect()

opt_in_df |>
  select(variant, days, opted_in, not_opted_in, cohorted) |>
  adorn_totals(c("row")) |>
  mutate(`% opted in` = prettyNum(100.0 * opted_in/cohorted, digits = 4)) |>
  rename(Yes = opted_in, No = not_opted_in, Total = cohorted) |>
  flextable() |>
  add_header_row(values=c("", "Opted in", ""),
                 colwidths = c(2, 2, 2))

		Opted in
variant	days	Yes	No	Total	% opted in
pre	14	14,489	10,173	24,662	58.75
post	14	11,220	16,397	27,617	40.63
Total	28	25,709	26,570	52,279	49.18

res <- 
  opt_in_df |> 
  select(-cohorted, -days) |>
  column_to_rownames(var = "variant") |>
  chisq.test(correct = FALSE)

res$statistic

X-squared 
 1712.075

res$p.value

[1] 0

7.4.5 Natural experiment analysis

by_country <- 
  game_users |>
  filter(country %in% c('United States', 'Canada')) |>
  left_join(game_purchases, by = "user_id",
            keep = TRUE,
            suffix = c("", ".y")) |>
  group_by(country) |>
  summarize(cohorted = n_distinct(user_id),
            opted_in = n_distinct(user_id.y),
            .groups = "drop") |>
  mutate(pct_purchased = 1.0 * opted_in/cohorted) |>
  collect()

by_country |> kable()

country	cohorted	opted_in	pct_purchased
United States	45012	4958	0.1101484
Canada	20179	5011	0.2483275

res <- 
  by_country |> 
  select(-pct_purchased) |>
  column_to_rownames(var = "country") |>
  chisq.test(correct = FALSE)

res$statistic

X-squared 
 1447.273

res$p.value

[1] 1.122178e-316

To avoid a warning, we disconnect from the database once we’re done with it.

dbDisconnect(db, shutdown = TRUE)

Ew!↩︎
I compute() two tables so that I can refer to them using SQL below.↩︎