3 Time Series Analysis

Chapter 3 of Tanimura (2021) is really two chapters. The first part of the chapter discusses some finer details of dates, date-times, and time stamps. This material is important, but a little technical and bewildering without some concrete use cases. This complexity is compounded by the use of DuckDB here and PostgreSQL in Tanimura (2021), as some details work a little differently from one backend to the other. Finally, while there are some interesting issues in using dbplyr with timestamp data, there probably need data and settings to fully comprehend. As such, I have put the code related to this part of Chapter 3 of Tanimura (2021) in Section 1, part of the appendix.

The second part of the chapter is where Tanimura (2021) really starts to take off. This is where it starts playing around with real data.

library(DBI)
library(tidyverse)
library(dbplyr)
library(knitr)

3.1 The Retail Sales Data Set

We read the data

db <- dbConnect(duckdb::duckdb())

retail_sales <-
  tbl(db, "read_csv_auto('data/us_retail_sales.csv')") |>
  compute(name = "retail_sales")

Now that we have the data in R, a few questions arise.

First, why would we move it to a database? Second, how can we move it to a database? Related to the previous question will be: which database to we want to move it to?

Taking these questions in turn, the first one is a good question. With retail_sales as a local data frame, we can run almost all the analyses below with only the slightest modifications. The modifications needed are to replace every instance of window_order() with arrange().¹ The “almost all” relates to the moving average, which relies on window_frame() from dbplyr, which has no exact equivalent in dplyr.

So the first point is that “almost all” implies an advantage for using dbplyr. On occasion, the SQL engine provided by PostgreSQL will allow us to do some data manipulations more easily than we can in R. But of course, there are many cases when the opposite is true. That said, why not have both? Store data in a database and collect() as necessary to do things that R is better at?

Second, performance can be better using SQL. Compiling a version of this chapter using dplyr with a local data frame for the remainder took just under 14 seconds. Using a PostgreSQL backend, it took 8.5 seconds. Adding back in the queries with window_order() that I removed so that I could compile with dplyr and using DuckDB as the backend, the document took 8.3 seconds to compile (this beat out PostgreSQL doing the same in 9.6 seconds). While these differences are not practically very significant, they could be with more demanding tasks. Also, a database will not always beat R or Python for performance, but often will and having the option to use a database backend is a good thing.

Third, having the data in a database allows you to interrogate the data using SQL. If you are more familiar with SQL, or just know how to do a particular task in SQL, this can be beneficial.²

Fourth, I think there is merit in separating the tasks of acquiring and cleaning data from the task of analysing those data. Many data analysts have a work flow that entails ingesting and cleaning data for each analysis task. My experience is that it is often better to do the ingesting and cleaning once and then reuse the cleaned data in subsequent analyses. A common pattern involves reusing the same data in many analyses and it can be helpful to divide the tasks in a way that using an SQL database encourages. Also the skills in ingesting and cleaning data can be different from those for analysing those data, so sometimes it makes sense for one person to do one task, push the data to a database, and then have someone else do some or all of the analysis.

Regarding the second question, there are a few options. But for this chapter we will use duckdb

To install DuckDB, all we have to do is install.packages("duckdb").

Then we can create a connection as follows.

Here we use the default of an in-memory database. At the end of this chapter, we discuss how we could store the data in a file (say, legislators.duckdb) if we want persistent storage.

3.2 Trending the Data

3.2.1 Simple Trends

SELECT sales_month, sales
FROM retail_sales
WHERE kind_of_business = 'Retail and food services sales, total'
ORDER BY 1

Displaying records 1 - 10
sales_month	sales
1992-01-01	146376
1992-02-01	147079
1992-03-01	159336
1992-04-01	163669
1992-05-01	170068
1992-06-01	168663
1992-07-01	169890
1992-08-01	170364
1992-09-01	164617
1992-10-01	173655

retail_sales |>
    filter(kind_of_business == 'Retail and food services sales, total') |>
  select(sales_month, sales) |>
  ggplot(aes(x = sales_month, y = sales)) +
  geom_line()

SELECT date_part('year',sales_month) as sales_year,
    sum(sales) as sales
FROM retail_sales
WHERE kind_of_business = 'Retail and food services sales, total'
GROUP BY 1
;

Displaying records 1 - 10
sales_year	sales
1992	2014102
1993	2153095
1994	2330235
1995	2450628
1996	2603794
1997	2726131
1998	2852956
1999	3086990
2000	3287537
2001	3378906

retail_sales |>
  filter(kind_of_business == 'Retail and food services sales, total') |>
  mutate(sales_year = year(sales_month)) |>
  group_by(sales_year) |>
  summarize(sales = sum(sales, na.rm = TRUE)) |>
  ggplot(aes(x = sales_year, y = sales)) +
  geom_line()

SELECT date_part('year',sales_month) as sales_year, 
  kind_of_business, sum(sales) as sales
FROM retail_sales
WHERE kind_of_business IN 
          ('Book stores',
           'Sporting goods stores',
           'Hobby, toy, and game stores')
GROUP BY 1,2
ORDER BY 1;

Displaying records 1 - 10
sales_year	kind_of_business	sales
1992	Book stores	8327
1992	Sporting goods stores	15583
1992	Hobby, toy, and game stores	11251
1993	Book stores	9108
1993	Sporting goods stores	16791
1993	Hobby, toy, and game stores	11651
1994	Book stores	10107
1994	Sporting goods stores	18825
1994	Hobby, toy, and game stores	12850
1995	Book stores	11196

3.2.2 Comparing Components

retail_sales |>
  filter(kind_of_business %in% 
           c('Book stores',
             'Sporting goods stores',
             'Hobby, toy, and game stores')) |>
  mutate(sales_year = year(sales_month)) |>
  group_by(sales_year, kind_of_business) |>
  summarize(sales = sum(sales, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = sales_year, y = sales, color = kind_of_business)) +
  geom_line() +
  theme(legend.position = "top")

SELECT sales_month, kind_of_business, sales
FROM retail_sales
WHERE kind_of_business IN ('Men''s clothing stores','Women''s clothing stores')
ORDER BY 1,2;

Displaying records 1 - 10
sales_month	kind_of_business	sales
1992-01-01	Men’s clothing stores	701
1992-01-01	Women’s clothing stores	1873
1992-02-01	Men’s clothing stores	658
1992-02-01	Women’s clothing stores	1991
1992-03-01	Men’s clothing stores	731
1992-03-01	Women’s clothing stores	2403
1992-04-01	Men’s clothing stores	816
1992-04-01	Women’s clothing stores	2665
1992-05-01	Men’s clothing stores	856
1992-05-01	Women’s clothing stores	2752

retail_sales |>
  filter(kind_of_business %in% c("Men's clothing stores",
                                 "Women's clothing stores")) |>
  select(sales_month, kind_of_business, sales) |>
  ggplot(aes(x = sales_month, y = sales, color = kind_of_business)) +
  geom_line() +
  theme(legend.position = "top")

SELECT date_part('year',sales_month) as sales_year,
  kind_of_business, sum(sales) as sales
FROM retail_sales
WHERE kind_of_business IN 
        ('Men''s clothing stores',
        'Women''s clothing stores')
GROUP BY 1, 2
ORDER BY 1, 2;

Displaying records 1 - 10
sales_year	kind_of_business	sales
1992	Men’s clothing stores	10179
1992	Women’s clothing stores	31815
1993	Men’s clothing stores	9962
1993	Women’s clothing stores	32350
1994	Men’s clothing stores	10032
1994	Women’s clothing stores	30585
1995	Men’s clothing stores	9315
1995	Women’s clothing stores	28696
1996	Men’s clothing stores	9546
1996	Women’s clothing stores	28238

retail_sales |>
  filter(kind_of_business %in% 
           c("Men's clothing stores",
             "Women's clothing stores")) |>
  mutate(sales_year = year(sales_month)) |>
  group_by(sales_year, kind_of_business) |>
  summarize(sales = sum(sales, na.rm = TRUE), .groups = "drop") |>
  ggplot(aes(x = sales_year, y = sales, color = kind_of_business)) +
  geom_line() +
  theme(legend.position = "top")

SELECT date_part('year', sales_month) AS sales_year,
  sum(CASE WHEN kind_of_business = 'Women''s clothing stores' 
          then sales 
          END) AS womens_sales,
  sum(CASE WHEN kind_of_business = 'Men''s clothing stores' 
          then sales 
          END) AS mens_sales
FROM retail_sales
WHERE kind_of_business IN 
   ('Men''s clothing stores',
    'Women''s clothing stores')
GROUP BY 1
ORDER BY 1;

Displaying records 1 - 10
sales_year	womens_sales	mens_sales
1992	31815	10179
1993	32350	9962
1994	30585	10032
1995	28696	9315
1996	28238	9546
1997	27822	10069
1998	28332	10196
1999	29549	9667
2000	31447	9507
2001	31453	8625

pivoted_sales <-
  retail_sales |>
  filter(kind_of_business %in% 
           c("Men's clothing stores",
             "Women's clothing stores")) |>
  mutate(kind_of_business = if_else(kind_of_business == "Women's clothing stores",
                                    "womens", "mens"),
         sales_year = year(sales_month)) |>
  group_by(sales_year, kind_of_business) |>
  summarize(sales = sum(sales, na.rm = TRUE), .groups = "drop") |>
  pivot_wider(id_cols = "sales_year",
              names_from = "kind_of_business",
              names_glue = "{kind_of_business}_{.value}",
              values_from = "sales")  

pivoted_sales |>
  show_query()

<SQL>
SELECT
  sales_year,
  MAX(CASE WHEN (kind_of_business = 'mens') THEN sales END) AS mens_sales,
  MAX(CASE WHEN (kind_of_business = 'womens') THEN sales END) AS womens_sales
FROM (
  SELECT sales_year, kind_of_business, SUM(sales) AS sales
  FROM (
    SELECT
      sales_month,
      naics_code,
      CASE WHEN (kind_of_business = 'Women''s clothing stores') THEN 'womens' WHEN NOT (kind_of_business = 'Women''s clothing stores') THEN 'mens' END AS kind_of_business,
      reason_for_null,
      sales,
      EXTRACT(year FROM sales_month) AS sales_year
    FROM retail_sales
    WHERE (kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores'))
  ) q01
  GROUP BY sales_year, kind_of_business
) q02
GROUP BY sales_year

pivoted_sales |>
  arrange(sales_year) |>
  collect(n = 10) |>
  kable()

sales_year	mens_sales	womens_sales
1992	10179	31815
1993	9962	32350
1994	10032	30585
1995	9315	28696
1996	9546	28238
1997	10069	27822
1998	10196	28332
1999	9667	29549
2000	9507	31447
2001	8625	31453

pivoted_sales |>
  filter(sales_year <= 2019) |>
  group_by(sales_year) |>
  mutate(womens_minus_mens = womens_sales - mens_sales,
         mens_minus_womens = mens_sales - womens_sales) |>
  select(sales_year, womens_minus_mens, mens_minus_womens) |>
  ggplot(aes(y = womens_minus_mens, x = sales_year)) +
  geom_line()

pivoted_sales |>
  filter(sales_year <= 2019) |>
  group_by(sales_year) |>
  mutate(womens_times_of_mens = womens_sales / mens_sales) |>
  ggplot(aes(y = womens_times_of_mens, x = sales_year)) +
  geom_line()

pivoted_sales |>
  filter(sales_year <= 2019) |>
  group_by(sales_year) |>
  mutate(womens_pct_of_mens = (womens_sales / mens_sales - 1) * 100) |>
  ggplot(aes(y = womens_pct_of_mens, x = sales_year)) +
  geom_line()

3.2.3 Percent of Total Calculations

retail_sales |>
  filter(kind_of_business %in% 
           c("Men's clothing stores",
             "Women's clothing stores")) |>
  group_by(sales_month) |>
  mutate(total_sales = sum(sales, na.rm = TRUE))  |>
  ungroup() |>
  mutate(pct_total_sales = sales * 100 / total_sales) |>
  select(sales_month, kind_of_business, pct_total_sales) |>
  collect(n = 3) |>
  kable()

sales_month	kind_of_business	pct_total_sales
1992-06-01	Men’s clothing stores	26.02991
1992-06-01	Women’s clothing stores	73.97009
1992-08-01	Men’s clothing stores	22.62667

retail_sales |>
  filter(kind_of_business %in% 
           c("Men's clothing stores",
             "Women's clothing stores")) |>
  group_by(sales_month) |>
  mutate(total_sales = sum(sales, na.rm = TRUE)) |>
  ungroup() |>
  mutate(pct_total_sales = sales * 100 / total_sales) |>
  show_query()

<SQL>
SELECT *, (sales * 100.0) / total_sales AS pct_total_sales
FROM (
  SELECT *, SUM(sales) OVER (PARTITION BY sales_month) AS total_sales
  FROM retail_sales
  WHERE (kind_of_business IN ('Men''s clothing stores', 'Women''s clothing stores'))
) q01

retail_sales |>
  filter(kind_of_business %in% 
           c("Men's clothing stores",
             "Women's clothing stores")) |>
  group_by(sales_month) |>
  mutate(total_sales = sum(sales, na.rm = TRUE)) |>
  ungroup() |>
  mutate(pct_total_sales = sales * 100 / total_sales) |>
  ggplot(aes(y = pct_total_sales, x = sales_month, color = kind_of_business)) +
  geom_line()

3.2.4 Indexing to See Percent Change over Time

retail_sales |>
  filter(kind_of_business == "Women's clothing stores") |>
  mutate(sales_year = year(sales_month)) |>
  group_by(sales_year) |>
  summarize(sales = sum(sales, na.rm = TRUE)) |>
  ungroup() |>
  window_order(sales_year) |>
  mutate(index_sales = first(sales),
         pct_from_index = (sales/index_sales - 1) * 100) |>
  collect(n = 3) |>
  kable()

sales_year	sales	index_sales	pct_from_index
1992	31815	31815	0
1993	32350	31815	0
1994	30585	31815	-100

retail_sales |>
  filter(kind_of_business %in% c("Women's clothing stores",
                                 "Men's clothing stores"),
         sales_month <= '2019-12-31') |>
  mutate(sales_year = year(sales_month)) |>
  group_by(kind_of_business, sales_year) |>
  summarize(sales = sum(sales, na.rm = TRUE), .groups = "drop") |>
  group_by(kind_of_business) |>
  window_order(sales_year) |>
  mutate(index_sales = first(sales),
         pct_from_index = (sales/index_sales - 1) * 100) |>
  ungroup() |>
  ggplot(aes(y = pct_from_index, x = sales_year, color = kind_of_business)) +
  geom_line()

3.3 Rolling Time Windows

3.3.1 Calculating Rolling Time Windows

mvg_avg <-
  retail_sales |>
  filter(kind_of_business == "Women's clothing stores") |>
  window_order(sales_month) |>
  window_frame(-11, 0) |>
  mutate(moving_avg = mean(sales, na.rm = TRUE),
         records_count = n()) |>
  filter(sales_month >= '1993-01-01') 

mvg_avg |>
  select(sales_month, moving_avg, records_count) |>
  collect(n = 3) |>
  kable(digits = 2)

sales_month	moving_avg	records_count
1993-01-01	2672.08	12
1993-02-01	2673.25	12
1993-03-01	2676.50	12

mvg_avg |>
  ggplot(aes(x = sales_month)) +
  geom_line(aes(y = sales, colour = "Sales")) +
  geom_line(aes(y = moving_avg, colour = "Moving average"))

SELECT 
  sales_month,
  avg(sales) over w AS moving_avg,
  count(sales) over w AS records_count
FROM retail_sales
WHERE kind_of_business = 'Women''s clothing stores'
WINDOW w AS (order by sales_month 
             rows between 11 preceding and current row)

Displaying records 1 - 10
sales_month	moving_avg	records_count
1992-01-01	1873.000	1
1992-02-01	1932.000	2
1992-03-01	2089.000	3
1992-04-01	2233.000	4
1992-05-01	2336.800	5
1992-06-01	2351.333	6
1992-07-01	2354.429	7
1992-08-01	2392.250	8
1992-09-01	2410.889	9
1992-10-01	2445.300	10

retail_sales |>
  filter(kind_of_business == "Women's clothing stores") |>
  window_order(sales_month) |>
  window_frame(-11, 0) |>
  mutate(moving_avg = mean(sales, na.rm = TRUE),
         records_count = n()) |>
  select(sales_month, moving_avg, records_count) |>
  collect(n = 10) |>
  kable()

date_dim <-
  tibble(date = seq(as.Date('1993-01-01'), 
                    as.Date('2020-12-01'), 
                    by = "1 month")) |>
  copy_to(db, df = _, overwrite = TRUE, name = "date_dim")

WITH jan_jul AS (
  SELECT sales_month, sales
  FROM retail_sales 
  WHERE kind_of_business = 'Women''s clothing stores'
     AND date_part('month', sales_month) IN (1, 7))
     
SELECT a.date, b.sales_month, b.sales
FROM date_dim a
INNER JOIN jan_jul b 
ON b.sales_month BETWEEN a.date - interval '11 months' AND a.date
WHERE a.date BETWEEN '1993-01-01' AND '2020-12-01';

Displaying records 1 - 10
date	sales_month	sales
1993-01-01	1992-07-01	2373
1993-02-01	1992-07-01	2373
1993-03-01	1992-07-01	2373
1993-04-01	1992-07-01	2373
1993-05-01	1992-07-01	2373
1993-06-01	1992-07-01	2373
1993-01-01	1993-01-01	2123
1993-02-01	1993-01-01	2123
1993-03-01	1993-01-01	2123
1993-04-01	1993-01-01	2123

jan_jul <-
  retail_sales |>
  filter(kind_of_business == "Women's clothing stores",
         month(sales_month) %in% c(1, 7)) |>
  select(sales_month, sales)

date_dim |>
  mutate(date_start = date - months(11)) |>
  inner_join(jan_jul, 
             join_by(between(y$sales_month, x$date_start, x$date))) |>
  select(date, sales_month, sales) |>
  collect(n = 3) |>
  kable()

date	sales_month	sales
1993-01-01	1992-07-01	2373
1993-02-01	1992-07-01	2373
1993-03-01	1992-07-01	2373

date_dim |>
  mutate(date_start = date - months(11)) |>
  inner_join(jan_jul, 
             join_by(between(y$sales_month, x$date_start, x$date))) |>
  group_by(date) |>
  summarize(moving_avg = mean(sales, na.rm = TRUE),
            records = n()) |>
  collect(n = 3) |>
  kable()

date	moving_avg	records
1993-01-01	2248	2
1993-02-01	2248	2
1993-03-01	2248	2

WITH sales_months AS (
  SELECT distinct sales_month
  FROM retail_sales
  WHERE sales_month between '1993-01-01' and '2020-12-01')

SELECT a.sales_month, avg(b.sales) as moving_avg
FROM sales_months a
JOIN retail_sales b 
on b.sales_month between 
    a.sales_month - interval '11 months' and a.sales_month
  and b.kind_of_business = 'Women''s clothing stores' 
GROUP BY 1
ORDER BY 1
LIMIT 3;

3 records
sales_month	moving_avg
1993-01-01	2672.083
1993-02-01	2673.250
1993-03-01	2676.500

sales_months <-
  retail_sales |>
  filter(between(sales_month, 
                 as.Date('1993-01-01'), 
                 as.Date('2020-12-01'))) |>
  distinct(sales_month)

sales_months |>
  mutate(month_start = sales_month - months(11)) |>
  inner_join(retail_sales, 
             join_by(between(y$sales_month, x$month_start, x$sales_month)),
                     suffix = c("", "_y")) |>
  filter(kind_of_business == "Women's clothing stores") |>
  group_by(sales_month) |>
  summarize(moving_avg = mean(sales, na.rm = TRUE)) |>
  arrange(sales_month) |>
  collect(n = 3) |>
  kable()

sales_month	moving_avg
1993-01-01	2672.083
1993-02-01	2673.250
1993-03-01	2676.500

3.3.2 Calculating Cumulative Values

SELECT sales_month, sales,
  sum(sales) OVER w AS sales_ytd
FROM retail_sales
WHERE kind_of_business = 'Women''s clothing stores'
WINDOW w AS (PARTITION BY date_part('year', sales_month) 
             ORDER BY sales_month)
LIMIT 3;

3 records
sales_month	sales	sales_ytd
2017-01-01	2454	2454
2017-02-01	2763	5217
2017-03-01	3485	8702

ytd_sales <-
  retail_sales |>
  filter(kind_of_business == "Women's clothing stores") |>
  mutate(year = year(sales_month)) |>
  group_by(year) |>
  window_order(sales_month) |>
  mutate(sales_ytd = cumsum(sales)) |>
  ungroup() |>
  select(sales_month, sales, sales_ytd) 

ytd_sales |>
  filter(month(sales_month) %in% c(1:3, 12)) |>
  collect(n = 6) |>
  kable()

sales_month	sales	sales_ytd
2017-01-01	2454	2454
2019-01-01	2511	2511
2017-02-01	2763	5217
2019-02-01	2680	5191
2017-03-01	3485	8702
2019-03-01	3585	8776

factor <- 40/4.5

ytd_sales |>
  filter(year(sales_month) %in% 2016:2020) |>
  ggplot(aes(x = sales_month, y = sales_ytd)) +
  geom_bar(stat = "identity") +
  geom_line(aes(y = sales * factor, colour = I("blue"))) +
  scale_y_continuous(
    "Sales YTD", 
    sec.axis = sec_axis(~ . / factor, name = "Monthly Sales")
  )

SELECT a.sales_month, a.sales,
  sum(b.sales) AS sales_ytd
FROM retail_sales a
INNER JOIN retail_sales b ON 
 date_part('year',a.sales_month) = date_part('year',b.sales_month)
 AND b.sales_month <= a.sales_month
 AND b.kind_of_business = 'Women''s clothing stores'
WHERE a.kind_of_business = 'Women''s clothing stores'
GROUP BY 1,2;

Displaying records 1 - 10
sales_month	sales	sales_ytd
1992-12-01	4416	31815
1993-12-01	4170	32350
1992-11-01	2946	27399
1993-11-01	2923	28180
1992-10-01	2755	24453
1993-10-01	2713	25257
1992-09-01	2560	21698
1993-09-01	2622	22544
1992-08-01	2657	19138
1993-08-01	2626	19922

retail_sales_yr <-
  retail_sales |>
  mutate(year = year(sales_month))
  
retail_sales_yr |>
  filter(kind_of_business == "Women's clothing stores") |>
  inner_join(retail_sales_yr, 
             join_by(year, kind_of_business,
                     sales_month >= sales_month),
             suffix = c("", "_y")) |>
  group_by(sales_month, sales) |>
  summarize(sales_ytd = sum(sales_y, na.rm = TRUE),
            .groups = "drop") |>
  filter(month(sales_month) %in% c(1:3, 12)) |>
  collect(n = 6) |>
  kable()

sales_month	sales	sales_ytd
1992-02-01	1991	3864
1993-02-01	2005	4128
1994-02-01	1970	3756
1992-03-01	2403	6267
1993-03-01	2442	6570
1994-03-01	2560	6316

3.4 Analyzing with Seasonality

retail_sales |>
  filter(kind_of_business %in% c("Jewelry stores",
                                 "Book stores",
                                 "Grocery stores")) |>
  ggplot(aes(x = sales_month, y = sales)) +
  geom_line() +
  facet_wrap(vars(kind_of_business), nrow = 3, scales = "free")

3.4.1 Period-over-Period Comparisons: YoY and MoM

SELECT kind_of_business, sales_month, sales,
  lag(sales_month) OVER w AS prev_month,
  lag(sales) OVER w AS prev_month_sales
FROM retail_sales
WHERE kind_of_business = 'Book stores'
WINDOW w AS (PARTITION BY kind_of_business ORDER BY sales_month)

Displaying records 1 - 10
kind_of_business	sales_month	sales	prev_month	prev_month_sales
Book stores	1992-01-01	790	NA	NA
Book stores	1992-02-01	539	1992-01-01	790
Book stores	1992-03-01	535	1992-02-01	539
Book stores	1992-04-01	523	1992-03-01	535
Book stores	1992-05-01	552	1992-04-01	523
Book stores	1992-06-01	589	1992-05-01	552
Book stores	1992-07-01	592	1992-06-01	589
Book stores	1992-08-01	894	1992-07-01	592
Book stores	1992-09-01	861	1992-08-01	894
Book stores	1992-10-01	645	1992-09-01	861

books_w_lag <-
  retail_sales |>
  filter(kind_of_business == 'Book stores') |>
  group_by(kind_of_business) |>
  window_order(sales_month) |>
  mutate(prev_month = lag(sales_month),
         prev_month_sales = lag(sales)) |>
  select(kind_of_business,
         sales_month, sales,
         prev_month, prev_month_sales)

books_w_lag |>
  collect(n = 3) |>
  kable()

kind_of_business	sales_month	sales	prev_month	prev_month_sales
Book stores	1992-01-01	790	NA	NA
Book stores	1992-02-01	539	1992-01-01	790
Book stores	1992-03-01	535	1992-02-01	539

books_monthly <-
  books_w_lag |>
  mutate(pct_growth = (sales / prev_month_sales - 1) * 100) |>
  select(-prev_month, -prev_month_sales)

books_yearly <- 
  retail_sales |>
  filter(kind_of_business == 'Book stores') |>
  mutate(sales_year = year(sales_month)) |>
  group_by(sales_year) |>
  summarize(yearly_sales = sum(sales, na.rm = TRUE),
            .groups = "drop") |>
  window_order(sales_year) |>
  mutate(prev_year_sales = lag(yearly_sales),
         pct_growth = (yearly_sales/prev_year_sales - 1) * 100)

books_yearly |>
  collect(n = 3) |>
  kable(digits = 2)

sales_year	yearly_sales	prev_year_sales	pct_growth
1992	8327	NA	NA
1993	9108	8327	0
1994	10107	9108	0

books_monthly |>
  filter(!is.na(pct_growth)) |>
  ggplot(aes(x = sales_month, y = pct_growth)) +
  geom_line()

3.4.2 Period-over-Period Comparisons: Same Month Versus Last Year

books_lagged_year_month <-
  retail_sales |>
  filter(kind_of_business == 'Book stores') |>
  mutate(month = month(sales_month)) |>
  group_by(month) |>
  window_order(sales_month) |>
  mutate(prev_year_month = lag(sales_month),
         prev_year_sales = lag(sales)) |>
  ungroup() |>
  select(sales_month, sales, prev_year_month, prev_year_sales)

books_lagged_year_month |>
  filter(month(sales_month) <= 2, 
         year(sales_month) <= 1994) |>
  arrange(sales_month) |>
  collect(n = 6) |>
  kable()

sales_month	sales	prev_year_month	prev_year_sales
1992-01-01	790	NA	NA
1992-02-01	539	NA	NA
1993-01-01	998	1992-01-01	790
1993-02-01	568	1992-02-01	539
1994-01-01	1053	1993-01-01	998
1994-02-01	635	1993-02-01	568

books_lagged_year_month |>
  mutate(dollar_diff = sales - prev_year_sales,
         pct_diff = dollar_diff/prev_year_sales * 100) |>
  select(-prev_year_month, -prev_year_sales) |>
  filter(month(sales_month) == 1) |>
  collect(n = 3) |>
  kable(digits = 2)

sales_month	sales	dollar_diff	pct_diff
1992-01-01	790	NA	NA
1993-01-01	998	208	0
1994-01-01	1053	55	0

books_lagged_year_month |>
  filter(!is.na(prev_year_sales)) |>
  mutate(dollar_diff = sales - prev_year_sales,
         pct_diff = dollar_diff/prev_year_sales * 100) |>
  select(sales_month, sales, dollar_diff, pct_diff) |>
  pivot_longer(-sales_month) |>
  collect() |>
  mutate(name = fct_inorder(name)) |>
  ggplot(aes(x = sales_month, y = value)) +
  geom_line() +
  facet_wrap(. ~ name, nrow = 3, scales = "free")

With PostgreSQL, we would use to_char(sales_month,'Month'); with DuckDB, the equivalent is monthname(sales_month). To use an approach that works with either backend, we draw on arguments to the lubridate function month().

sales_92_94 <-
  retail_sales |>
  filter(kind_of_business == 'Book stores',
         year(sales_month) %in% 1992:1994) |>
  select(sales_month, sales) |>
  mutate(month_number = month(sales_month),
         month_name = month(sales_month, 
                            label = TRUE, abbr = FALSE),
         year = as.integer(year(sales_month)))

sales_92_94 |>
  pivot_wider(id_cols = c(month_number, month_name),
              names_from = year,
              names_prefix = "sales_",
              values_from = sales) |>
  kable()

month_number	month_name	sales_1992	sales_1993	sales_1994
1	January	790	998	1053
2	February	539	568	635
3	March	535	602	634
4	April	523	583	610
5	May	552	612	684
6	June	589	618	724
7	July	592	607	678
8	August	894	983	1154
9	September	861	903	1022
10	October	645	669	732
11	November	642	692	772
12	December	1165	1273	1409

sales_92_94 |>
  collect() |>
  mutate(year = factor(year),
         month_name = month(month_number, label = TRUE)) |>
  ggplot(aes(x = month_name, y = sales, 
             group = year, colour = year)) +
  geom_line()

3.4.3 Comparing to Multiple Prior Periods

prev_three <-
  retail_sales |>
  filter(kind_of_business == 'Book stores') |>
  mutate(month = month(sales_month)) |>
  group_by(month) |>
  window_order(sales_month) |>
  mutate(prev_sales_1 = lag(sales, 1),
         prev_sales_2 = lag(sales, 2),
         prev_sales_3 = lag(sales, 3)) |>
  ungroup()

prev_three |>
  filter(month == 1) |>
  select(sales_month, sales, starts_with("prev_sales")) |>
  collect(n = 5) |>
  kable()

sales_month	sales	prev_sales_1	prev_sales_2	prev_sales_3
1992-01-01	790	NA	NA	NA
1993-01-01	998	790	NA	NA
1994-01-01	1053	998	790	NA
1995-01-01	1308	1053	998	790
1996-01-01	1373	1308	1053	998

prev_three |>
  mutate(avg_prev_three = (prev_sales_1 + 
                         prev_sales_2 + 
                         prev_sales_3)/3,
         pct_of_prev_3 = 100 * sales/avg_prev_three) |>
  select(sales_month, sales, pct_of_prev_3) |>
  filter(month(sales_month) == 1,
         year(sales_month) %in% c(1995:1997, 2017:2019)) |>
  collect(n = 10) |>
  kable(digits = 2)

sales_month	sales	pct_of_prev_3
1995-01-01	1308	138.12
1996-01-01	1373	122.63
1997-01-01	1558	125.17
2017-01-01	1386	94.63
2018-01-01	1217	84.95
2019-01-01	1004	74.74

prev_three_win <-
  retail_sales |>
  filter(kind_of_business == 'Book stores') |>
  mutate(month = month(sales_month)) |>
  group_by(month) |>
  window_order(sales_month) |>
  window_frame(-3, -1) |>
  mutate(avg_prev_three = mean(sales, na.rm = TRUE)) |>
  ungroup() |>
  mutate(pct_of_prev_3 = 100 * sales/avg_prev_three)

prev_three_win |>
  select(sales_month, sales, pct_of_prev_3) |>
  filter(month(sales_month) == 1,
         year(sales_month) %in% c(1995:1997, 2017:2019)) |>
  collect(n = 10) |>
  kable(digits = 2)

sales_month	sales	pct_of_prev_3
1995-01-01	1308	138.12
1996-01-01	1373	122.63
1997-01-01	1558	125.17
2017-01-01	1386	94.63
2018-01-01	1217	84.95
2019-01-01	1004	74.74

3.5 Persistent storage

Having created a database connection, we can write the local data frame to the database using (say) copy_to().

We could specify temporary = FALSE if we wanted the data to be there permanently.³

3.5.1 Using PostgreSQL

3.5.2 Read-only databases

In some cases, you will have access to a database, but no write privileges for that database. In such a case, copy_inline() can be useful.⁴ Note that it seems you cannot interrogate a table created using copy_inline() using SQL, though it will behave in most respects just like a table created using copy_to() when using dbplyr. It is useful to note that copy_inline() is probably not a good solution if your data are hundreds of thousands of rows or more because the table is effectively turned into literal SQL.

retail_sales_alt <- copy_inline(db, retail_sales_local)

3.5.3 Closing the database connection

dbDisconnect(db, shutdown=TRUE)

I requested a tweak to dplyr that would have avoided the need to do this, but my request was denied. Given how awesome dbplyr/dplyr is, I cannot complain.↩︎
Though I will argue later that transitioning to dplyr/dbplyr is actually not difficult.↩︎
Obviously this would not make sense if db is a connection to an in-memory database.↩︎
I requested this function for a common use case I have. Thank you to the dbplyr team for making it happen.↩︎