library(tidyverse)
library(DBI)
library(knitr)
library(dbplyr)
9 Conclusion
<- dbConnect(duckdb::duckdb())
db
<-
legislators_terms tbl(db, "read_csv_auto('data/legislators_terms.csv')") |>
compute(name = "legislators_terms")
9.1 Funnel Analysis
The issue described in this section of Tanimura (2021) is more difficult to get to using dbplyr
. This seems the natural translation of the ideas represented in the SQL there.
|>
users left_join(step_one, by = "user_id",
keep = TRUE, suffix = c("", "_1")) |>
left_join(step_two, by = "user_id",
keep = TRUE, suffix = c("", "_2")) |>
summarize(all_users = n_distinct(user_id),
step_one_users = n_distinct(user_id_1),
step_two_users = n_distinct(user_id_2)) |>
mutate(pct_step_one = step_one_users/all_users,
pct_one_to_two = step_two_users / step_one_users)
This yields the second of the two options provided, which seems to be the implicitly preferred one if we want to capture all step_two_users
—even if some customers go directly to step_two
—though it does make pct_one_to_two
a bit more difficult to interpret in such cases.
9.2 Churn, Lapse, and Other Definitions of Departure
This query seems easier to follow with a separate window specification (WINDOW w AS
) and using a CTE. Because avg(INTERVAL)
is not defined in DuckDB, I use the epoch()
function to convert the interval to seconds.1
WITH gap_intervals AS (
SELECT id_bioguide, term_start,
lag(term_start) OVER w AS prev,
lag(term_start) OVER w) as gap_interval
age(term_start, FROM legislators_terms
WHERE term_type = 'rep'
AS (partition BY id_bioguide
WINDOW w ORDER BY term_start))
SELECT avg(epoch(gap_interval)) AS avg_gap
FROM gap_intervals
WHERE gap_interval IS NOT NULL;
avg_gap |
---|
69709258 |
Translating this query into dplyr
is straightforward, with again most of the work being done in creation of gap_intervals
. Note that we can refer to prev
created in an earlier part of the mutate
call in creating gap_interval
. It seems that dbplyr
detects this kind of situation and cleverly puts the calculation of prev
into a subquery.
<-
gap_intervals |>
legislators_terms filter(term_type == 'rep') |>
group_by(id_bioguide) |>
window_order(term_start) |>
mutate(prev = lag(term_start),
gap_interval = age(term_start, prev)) |>
ungroup() |>
select(id_bioguide, term_start, prev, gap_interval)
This makes it effectively a single line of code to transform the data into the average shown above. Note that, after collect()
, we use as.duration()
from the lubridate
package to transform the result from a number of seconds back into a duration.2 The duration type from lubridate
provides a user-friendly print method that is easier for humans to interpret.
|>
gap_intervals summarize(avg_gap = mean(epoch(gap_interval))) |>
collect() |>
mutate(avg_gap = as.duration(avg_gap)) |>
kable()
avg_gap |
---|
69709258.3546315s (~2.21 years) |
With gap_intervals
in our pocket, the following query is much simpler than the SQL shown in the book.
<-
gap_months |>
gap_intervals mutate(gap_months = year(gap_interval) * 12 + month(gap_interval)) |>
count(gap_months, name = "instances") |>
arrange(gap_months) |>
ungroup()
|>
gap_months collect(n = 3) |>
kable()
gap_months | instances |
---|---|
NA | 11236 |
1 | 25 |
2 | 4 |
The following plot better matches what is shown in the book, where it seems the caption is incorrect.
|>
gap_months filter(between(gap_months, 14, 49)) |>
ggplot(aes(x = gap_months, y = instances)) +
geom_bar(stat = "identity")
<-
latest_date |>
legislators_terms summarize(max = max(term_start)) |>
pull()
<-
intervals |>
legislators_terms filter(term_type == 'rep') |>
group_by(id_bioguide) |>
summarize(max_date = max(term_start), .groups = "drop") |>
mutate(interval_since_last = age(latest_date, max_date))
|>
intervals mutate(years_since_last = year(interval_since_last)) |>
count(years_since_last) |>
arrange(years_since_last) |>
collect(n = 3) |>
kable()
years_since_last | n |
---|---|
0 | 6 |
1 | 440 |
2 | 1 |
|>
intervals mutate(months_since_last = year(interval_since_last) * 12 +
month(interval_since_last)) |>
count(months_since_last, name = "reps") |>
mutate(status = case_when(months_since_last <= 23 ~ 'Current',
<= 48 ~ 'Lapsed',
months_since_last TRUE ~ 'Churned')) |>
group_by(status) |>
summarize(total_reps = sum(reps, na.rm = TRUE)) |>
kable()
status | total_reps |
---|---|
Churned | 10685 |
Current | 446 |
Lapsed | 105 |
9.3 Basket Analysis
This seems like a case where the programmability of R and the SQL-generation capability of dbplyr
to address this more comprehensively. To make this tangible, let’s make some data and put it in the database:
<- tribble(
purchases ~customer_id, ~product,
1, "bananas",
2, "apples",
2, "oranges",
3, "apples",
3, "oranges",
3, "bananas",
4, "bananas",
4, "apples",
4, "oranges",
4, "passionfruit") |>
mutate(customer_id = as.integer(customer_id)) |>
copy_inline(db, df = _)
Now, let’s make a function that performs as much of the analysis in the database as possible. The only data we bring into R relates to the distinct number of products purchased by customers. So let’s say some customers purchase one product, others two, three, or four, while some customers purchase six products. In this case n_prods = c(1, 2, 3, 4, 6)
and that is the only data we need in R from the database. If p = 2
(i.e., we’re interested in product pairs), we then create all the combinations of 2 products for each value in n_prods
. For a customer with one product, there is no pair. For a customer with two products, there is one pair. For a customer with three products, there are two pairs: {1, 2}
and {2, 3}
. These combinations are created in R and the resulting table (combos
) is passed back to the database.
From there, everything is processed in the database.
In creating cust_prods
, each unique product purchased by each customer is given a within-customer ID (prod_num
) using the row_number()
window function, and note is made of the number of unique products purchased by the customer (n_products
). We then join cust_prods
with combos
using (customer_id, n_products)
to create base
, which contains customer IDs, as well as the “baskets” of products represented in the form of prod_num
values stored in p_1
, p_2
, etc.
Next, we want to add the p
actual products in each basket from cust_prods
to the data stored in base
. For example, we want to turn p_1
of 1
into product_1
of "apples"
. The get_prod()
function does this for one product at a time and we use Reduce
from base R to repeatedly apply this idea p
times.
Finally, we collapse the product_i
fields into a single basket
field analogous to what is done in Tanimura (2021).
<- function(df, p) {
get_baskets
<- function(n, p) {
get_combos if (n < p) return(NULL)
<- t(matrix(combn(sort(n), p), nrow = p))
combos colnames(combos) <- paste0("p_", 1:p)
<- as_tibble(combos)
combos $row_num <- 1:nrow(combos)
comboscross_join(tibble(n_products = n), combos)
}
<-
n_prods |>
df distinct(customer_id, product) |>
group_by(customer_id) |>
summarize(n_products = n()) |>
select(n_products) |>
distinct() |>
pull()
<-
combos |>
n_prods lapply(get_combos, p = p) |>
bind_rows() |>
copy_inline(db, df = _)
<-
cust_prods |>
df group_by(customer_id) |>
window_order(product) |>
mutate(prod_num = row_number()) |>
ungroup() |>
group_by(customer_id) |>
mutate(n_products = n()) |>
ungroup()
<-
base |>
cust_prods select(customer_id, n_products) |>
distinct() |>
inner_join(combos, by = join_by(n_products)) |>
select(-n_products)
<- function(i) {
get_prod |>
cust_prods rename_with(function(x) paste0("p_", i), "prod_num") |>
inner_join(base) |>
rename_with(function(x) paste0("product_", i), "product") |>
select(customer_id, row_num, starts_with("product_"))
}
Reduce(inner_join, lapply(1:p, get_prod)) |>
arrange(customer_id, row_num) |>
mutate(basket =
sql(paste0("concat(", paste0("product_", 1:p, collapse = ", ', ', "), ")"))) |>
compute() |>
select(customer_id, basket)
}
Now we can apply our function to our data to get two-item baskets.
<- get_baskets(purchases, 2)
pairs
|> kable() pairs
customer_id | basket |
---|---|
2 | apples, oranges |
3 | apples, bananas |
3 | apples, oranges |
3 | bananas, oranges |
4 | apples, bananas |
4 | apples, oranges |
4 | apples, passionfruit |
4 | bananas, oranges |
4 | bananas, passionfruit |
4 | oranges, passionfruit |
And we can identify the most popular baskets easily:
|>
pairs count(basket) |>
arrange(desc(n)) |>
kable()
basket | n |
---|---|
apples, oranges | 3 |
apples, bananas | 2 |
bananas, oranges | 2 |
apples, passionfruit | 1 |
bananas, passionfruit | 1 |
oranges, passionfruit | 1 |
But the function also works for three-item baskets …
get_baskets(purchases, 3) |> kable()
customer_id | basket |
---|---|
3 | apples, bananas, oranges |
4 | apples, bananas, oranges |
4 | apples, bananas, passionfruit |
4 | apples, oranges, passionfruit |
4 | bananas, oranges, passionfruit |
… and four-item baskets too.
get_baskets(purchases, 4) |> kable()
customer_id | basket |
---|---|
4 | apples, bananas, oranges, passionfruit |
9.4 Resources
9.5 Final Thoughts
In PostgreSQL, the value returned is of
interval
type, which cannot be interpreted by R in any case. The value in seconds can be, however.↩︎Note that the terminology of
lubridate
differs from that used by many SQL implementations. What thelubridate
cheat-sheet calls a duration is aninterval
in PostgreSQL or DuckDB. What thelubridate
cheat-sheet calls an interval is defined by a start date and an end date.↩︎