7  Experiment Analysis

7.1 Strengths and Limits of Experiment Analysis with SQL

I would reframe this discussion. We might be using a database because the results of an experiment are stored there.

An alternative approach might have the results of an experiment are extracted and exported to a CSV or Excel file and attached to an email and sent to you for analysis.1 But extracted from where? If the data are in a database, it would be better to cut out the middleman and just get the data directly.

The limitations of SQL essentially vanish when we access the data using dbplyr. R can do any statistical calculation we can think of, so we have no need of an online statistical calculator (though such calculators can help us frame our analyses and check our results).

Cathy does say that “many databases allow developers to extend SQL functionality with user-defined functions (UDFs) … but they are beyond the scope of this book.” For PostgreSQL there are PL/Python and PL/R, which allow creation of functions in Python and R, respectively. When I first started to use PostgreSQL, these extensions seemed pretty exciting and, because I was maintaining my own databases, I could use them. But over time, I found maintaining UDFs to be more effort than could be justified and I no longer use them. Instead, if I need to do analysis in Python or R, I will extract data from the database, do the analysis, then write data back to the database. While this likely partly reflects the kinds of analysis I do, I think that UDFs are likely to be off limits to most users because of the additional complexity of turning code into UDFs and because many users would lack sufficient database privileges to create these.

library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)
library(flextable)
library(janitor)

7.2 The Data Set

As discussed in Tanimura (2021), there are four tables used in this chapter. We can get these data from the GitHub site associated with Tanimura (2021).2

db <- dbConnect(duckdb::duckdb())

game_users <-
  tbl(db, "read_csv_auto('data/game_users.csv')") |>
  compute(name = "game_users")
game_actions <- 
  tbl(db, "read_csv_auto('data/game_actions.csv')") |>
  compute(name = "game_actions")
game_purchases <- tbl(db, "read_csv_auto('data/game_purchases.csv')")
exp_assignment <- tbl(db, "read_csv_auto('data/exp_assignment.csv')")

7.3 Types of Experiments

I would reword the first paragraph here to the following for clarity (edits in italics):

There is a wide range of experiments, If you can change something that a user, customer, constituent, or other entity experiences, you can in theory test the effect of that change on some outcome.

7.3.1 Experiments with Binary Outcomes: The Chi-Squared Test

To better match the approach of the book, I essentially create the contingency table in the database. An alternative approach would have been to collect() after summarize() and then do the statistical analysis in R. In fact, many of the functions in R are better set-up for this approach. However, this means bring more data into R and doing more calculation in R. If the experiment is very large or you would rather the database server do more of the work, then the approach below may be preferred.

cont_tbl <-
  exp_assignment |>
  left_join(game_actions, by = "user_id") |>
  group_by(variant, user_id) |>
  summarize(completed = 
              coalesce(any(action == "onboarding complete", na.rm = TRUE),
                       FALSE),
            .groups = "drop") |>
  count(variant, completed) |>
  mutate(completed = if_else(completed, "Yes", "No")) |>
  pivot_wider(names_from = completed, values_from = n) |>
  collect()

Using the packages janitor and flextable, I mimic the nicely formatted output shown in the book:

cont_tbl |>
  adorn_totals(c("row", "col")) |>
  mutate(`% Complete` = prettyNum(Yes/Total * 100, digits = 4)) |>
  flextable() |>
  add_header_row(values=c("", "Completed onboarding", ""),
                 colwidths = c(1, 2, 2))

Completed onboarding

variant

Yes

No

Total

% Complete

variant 1

38,280

11,995

50,275

76.14

control

36,268

13,629

49,897

72.69

Total

74,548

25,624

100,172

74.42

Now I can do the Chi-squared test. I need to turn variant into the row names of the contingency table so that we want a simple \(2 \times 2\) numeric table as the input to our statistical test and I use column_to_rownames() to this end. I then pipe the result into the base R function chisq.test(). I specified correct = FALSE so that my result matched what I got from the online calculator I found. I then display the Chi-squared statistic and the \(p\)-value.

res <- 
  cont_tbl |>
  column_to_rownames(var = "variant") |>
  chisq.test(correct = FALSE)

res$statistic
X-squared 
 157.0758 
res$p.value
[1] 4.926953e-36

7.3.2 Experiments with Continuous Outcomes: The t-Test

amounts <- 
  exp_assignment |>
  filter(exp_name == 'Onboarding') |>
  left_join(game_purchases, by = "user_id") |>
  group_by(variant, user_id) |>
  summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
            .groups = "drop")

t_test_stats <-
  amounts |>
  group_by(variant) |>
  summarize(n = n(),
            mean = mean(amount, na.rm = TRUE),
            sd = sd(amount, na.rm = TRUE)) |>
  collect()

t_test_stats |>
  kable(digits = 3)
variant n mean sd
variant 1 50275 3.688 19.22
control 49897 3.781 18.94

We can make a small function that we can pass a data frame to as df. The calculations assume that df contains two rows (one for each group) and columns named mean, sd, and n for the mean, standard deviation, and number of observations, respectively, in each group.

t_test <- function(df) {
  mean_diff = abs(df$mean[1] - df$mean[2])
  se_diff <- sqrt(sum(df$sd^2 / df$n))
  t_stat <- mean_diff / se_diff
  p <- pt(t_stat, df = sum(df$n))
  p_val <- 2 * min(p, 1 - p)
  return(list("statistic" = t_stat, "p-value" = p_val))
}

t_test(t_test_stats)
$statistic
[1] 0.7765458

$`p-value`
[1] 0.4374286

These values line up with those obtained from the online calculator I found.

An alternative approach would be to collect() the underlying data and do the \(t\)-test in R.

t_test_data <-
  amounts |>
  select(variant, amount) |>
  collect()

t.test(formula = amount ~ variant, data = t_test_data)

    Welch Two Sample t-test

data:  amount by variant
t = 0.77655, df = 100165, p-value = 0.4374
alternative hypothesis: true difference in means between group control and group variant 1 is not equal to 0
95 percent confidence interval:
 -0.1426893  0.3299478
sample estimates:
  mean in group control mean in group variant 1 
               3.781218                3.687589 
t_test_stats_2 <-
  amounts |>
  inner_join(game_actions, by = "user_id") |>
  filter(action == "onboarding complete") |>
  group_by(variant) |>
  summarize(n = n(),
            mean = mean(amount, na.rm = TRUE),
            sd = sd(amount, na.rm = TRUE)) |>
  collect()

t_test_stats_2 |>
  kable(digits = 3)
variant n mean sd
variant 1 38280 4.843 21.899
control 36268 5.202 22.049
t_test(t_test_stats_2)
$statistic
[1] 2.229649

$`p-value`
[1] 0.02577371

7.4 Challenges with Experiments and Options for Rescuing Flawed Experiments

7.4.1 Variant Assignment

7.4.2 Outliers

exp_assignment |>
  left_join(game_purchases, by = "user_id", keep = TRUE,
            suffix = c("", ".y")) |>
  inner_join(game_actions, by = "user_id") |>
  filter(action == "onboarding complete",
         exp_name == 'Onboarding') |>
  group_by(variant) |>
  summarize(total_cohorted = n_distinct(user_id),
            purchasers = n_distinct(user_id.y),
            .groups = "drop") |>
  mutate(pct_purchased = purchasers * 100.0 / total_cohorted) |>
  kable(digits = 2)
variant total_cohorted purchasers pct_purchased
control 36268 4988 13.75
variant 1 38280 4981 13.01

7.4.3 Time Boxing

amounts_boxed <- 
  exp_assignment |>
  filter(exp_name == 'Onboarding') |>
  mutate(exp_end = exp_date + days(7)) |>
  left_join(game_purchases, 
            by = join_by(user_id, exp_end >= purch_date)) |>
  group_by(variant, user_id) |>
  summarize(amount = sum(coalesce(amount, 0), na.rm = TRUE),
            .groups = "drop")


t_test_stats_boxed <-
  amounts_boxed |>
  group_by(variant) |>
  summarize(n = n(),
            mean = mean(amount, na.rm = TRUE),
            sd = sd(amount, na.rm = TRUE)) |>
  collect()

t_test_stats_boxed |>
  kable(digits = 3)
variant n mean sd
variant 1 50275 1.352 5.613
control 49897 1.369 5.766
t_test(t_test_stats_boxed)
$statistic
[1] 0.4920715

$`p-value`
[1] 0.6226699

7.4.4 Pre-Post Analysis

The original SQL query is something like this, but we make some tweaks to get this to work more naturally using dbplyr.

SELECT 
  CASE WHEN a.created BETWEEN '2020-01-13' AND '2020-01-26' THEN 'pre'
     WHEN a.created BETWEEN '2020-01-27' AND '2020-02-09' THEN 'post'
     END AS variant,
  count(distinct a.user_id) AS cohorted,
  count(distinct b.user_id) AS opted_in,
  count(distinct b.user_id) * 1.0 / count(DISTINCT a.user_id) AS pct_optin,
  count(distinct a.created) AS days
FROM game_users a
LEFT JOIN game_actions b ON a.user_id = b.user_id 
  AND b.action = 'email_optin'
WHERE a.created BETWEEN '2020-01-13' AND '2020-02-09'
GROUP BY 1
;

The first tweak is to translate the b.action = 'email_optin' part of the join into a filter, which we embed in the left_join() portion of our code (mirroring how it works in the SQL). The second tweak is to move the pct_optin out of the grouped aggregation portion of the query, as in the SQL it is referring to what we will later have as opted_in and cohorted. (In fact, I don’t bother with `pct_optin at all, as it’s easy to calculate when we use it in making a table.)

Note that COUNT(DISTINCT col) becomes n_distinct(col). We use keep = TRUE and suffix = c("", ".y") to store what is b.user_id in the SQL as user_id.y.

We calculate not_opted_in so that our table is better prepared for the chisq.test() we will pass it to later on.

opt_in_df <-
  game_users |>
  filter(between(created, 
                 as.Date('2020-01-13'), 
                 as.Date('2020-02-09'))) |>
  left_join(game_actions |> 
              filter(action == 'email_optin'), by = "user_id",
            keep = TRUE,
            suffix = c("", ".y")) |>
  mutate(variant = if_else(created <= '2020-01-26', 'pre', 'post')) |>
  group_by(variant) |>
  summarize(cohorted = n_distinct(user_id),
            opted_in = n_distinct(user_id.y),
            days = n_distinct(created),
            .groups = "drop") |>
  mutate(not_opted_in = cohorted - opted_in) |>
  collect()
opt_in_df |>
  select(variant, days, opted_in, not_opted_in, cohorted) |>
  adorn_totals(c("row")) |>
  mutate(`% opted in` = prettyNum(100.0 * opted_in/cohorted, digits = 4)) |>
  rename(Yes = opted_in, No = not_opted_in, Total = cohorted) |>
  flextable() |>
  add_header_row(values=c("", "Opted in", ""),
                 colwidths = c(2, 2, 2))

Opted in

variant

days

Yes

No

Total

% opted in

pre

14

14,489

10,173

24,662

58.75

post

14

11,220

16,397

27,617

40.63

Total

28

25,709

26,570

52,279

49.18

res <- 
  opt_in_df |> 
  select(-cohorted, -days) |>
  column_to_rownames(var = "variant") |>
  chisq.test(correct = FALSE)

res$statistic
X-squared 
 1712.075 
res$p.value
[1] 0

7.4.5 Natural experiment analysis

by_country <- 
  game_users |>
  filter(country %in% c('United States', 'Canada')) |>
  left_join(game_purchases, by = "user_id",
            keep = TRUE,
            suffix = c("", ".y")) |>
  group_by(country) |>
  summarize(cohorted = n_distinct(user_id),
            opted_in = n_distinct(user_id.y),
            .groups = "drop") |>
  mutate(pct_purchased = 1.0 * opted_in/cohorted) |>
  collect()

by_country |> kable()
country cohorted opted_in pct_purchased
United States 45012 4958 0.1101484
Canada 20179 5011 0.2483275
res <- 
  by_country |> 
  select(-pct_purchased) |>
  column_to_rownames(var = "country") |>
  chisq.test(correct = FALSE)

res$statistic
X-squared 
 1447.273 
res$p.value
[1] 1.122178e-316

To avoid a warning, we disconnect from the database once we’re done with it.

dbDisconnect(db, shutdown = TRUE)

  1. Ew!↩︎

  2. I compute() two tables so that I can refer to them using SQL below.↩︎